Open 00krishna opened 6 months ago
@00krishna Thank you for your interest.
I haven't carefully thought about the pros and cons of doing so yet. But the primary reason for LabeledArray
to live in this package is to make sure its design accommodates whatever peculiar requirement encountered for readstat
and writestat
. There are alternatives such as CategoricalArrays.jl that is more feature-rich and PooledArrays.jl that is very lightweight. There are subtle differences among the three regarding the design philosophy and priorities. Hopefully, one always finds something that fits the need best.
Thanks @junyuan-chen this is helpful. Yeah, I was looking at CategoricalArrays.jl
too, and I understand your view on keeping LabelledArray
within ReadStatTables.jl
. However, I was looking at CategoricalArrays
and it does not seem to have a mapping between an index value and a category name, such as 1 => "january"
. I read through the package docs as well as the DataFrames docs on categorical variables, but I did not see a way to preserve both the index values and categorical/text values at the same time.
Now I have not used CategoricalArrays
before, hence I am just depending on what I read. And it could be the docs don't provide an example of this kind of key-value indexing as LabelledArrays
does. IndirectArrays
seems like the closest match, but that package seems to not have any docs, so I am just going off the README :). So I was just wondering if you had seen a package that supports this kind of key-value structure for categorical data?
CategoricalArray
decides how the numerical values are assigned by itself. So, the encoding process is something it takes as an internal implementation detail that users are not supposed to directly intervene. This is actually one of the main reasons why LabeledArray
is introduced.
One possibility is to make LabeledArrays.jl a subpackage that lives inside this repo. This means that it will has its own UUID and registered with General
registry. However, I still need to think about whether that's a good thing to do and will come back to this later once I figure out.
Excellent. Yeah, that is totally fair. I appreciate your consideration.
You are 100% correct that the issue seems to be that CategoricalArrays
does not allow the user to specify the index values for each category. I am pulling the index values for my data samples from US census metadata, so they have their own elaborate system of categorizations.
But certainly, take your time to consider what you think is possible. Thanks again for your time.
Hello. I was wondering if there is any consideration about breaking
LabelledArrays
into its own package?The reason is that
LabelledArrays
provide a really nice functionality that could be used in something likeDataFrames.jl
. Say I have a dataframe that has a categorical column, such as the month of the year. Here is an example.Using numerical indices for categorical variables like month, makes it harder for users to read. Hence a more intuitive interface is to swap the view for the
month
variable to look like:We could potentially use
LabelledArrays
in a dataframe, but right now that array library is bundled with the fullReadStatTables.jl
package. Breaking theLabelledArray.jl
library could allow some flexibility for usingLabelledArrays
in other place.Please let me know if you can consider my request. Thank you.