junyuan-chen / ReadStatTables.jl

Read and write Stata, SAS and SPSS data files with Julia tables
MIT License
40 stars 6 forks source link

Breaking `LabelledArrays` into a separate package #41

Open 00krishna opened 6 months ago

00krishna commented 6 months ago

Hello. I was wondering if there is any consideration about breaking LabelledArrays into its own package?

The reason is that LabelledArrays provide a really nice functionality that could be used in something like DataFrames.jl. Say I have a dataframe that has a categorical column, such as the month of the year. Here is an example.

julia> DataFrame(month = [1, 2, 3], sensor1 = [2.1, 2.4, 5.1])
3×2 DataFrame
 Row │ month  sensor1 
     │ Int64  Float64 
─────┼────────────────
   1 │     1      2.1
   2 │     2      2.4
   3 │     3      5.1

Using numerical indices for categorical variables like month, makes it harder for users to read. Hence a more intuitive interface is to swap the view for the month variable to look like:

Row │ month     sensor1 
     │ String    Float64 
─────┼───────────────────
   1 │ january       2.1
   2 │ february      2.4
   3 │ march         5.1

We could potentially use LabelledArrays in a dataframe, but right now that array library is bundled with the full ReadStatTables.jl package. Breaking the LabelledArray.jl library could allow some flexibility for using LabelledArrays in other place.

Please let me know if you can consider my request. Thank you.

junyuan-chen commented 6 months ago

@00krishna Thank you for your interest.

I haven't carefully thought about the pros and cons of doing so yet. But the primary reason for LabeledArray to live in this package is to make sure its design accommodates whatever peculiar requirement encountered for readstat and writestat. There are alternatives such as CategoricalArrays.jl that is more feature-rich and PooledArrays.jl that is very lightweight. There are subtle differences among the three regarding the design philosophy and priorities. Hopefully, one always finds something that fits the need best.

00krishna commented 6 months ago

Thanks @junyuan-chen this is helpful. Yeah, I was looking at CategoricalArrays.jl too, and I understand your view on keeping LabelledArray within ReadStatTables.jl. However, I was looking at CategoricalArrays and it does not seem to have a mapping between an index value and a category name, such as 1 => "january". I read through the package docs as well as the DataFrames docs on categorical variables, but I did not see a way to preserve both the index values and categorical/text values at the same time.

Now I have not used CategoricalArrays before, hence I am just depending on what I read. And it could be the docs don't provide an example of this kind of key-value indexing as LabelledArrays does. IndirectArrays seems like the closest match, but that package seems to not have any docs, so I am just going off the README :). So I was just wondering if you had seen a package that supports this kind of key-value structure for categorical data?

junyuan-chen commented 6 months ago

CategoricalArray decides how the numerical values are assigned by itself. So, the encoding process is something it takes as an internal implementation detail that users are not supposed to directly intervene. This is actually one of the main reasons why LabeledArray is introduced.

One possibility is to make LabeledArrays.jl a subpackage that lives inside this repo. This means that it will has its own UUID and registered with General registry. However, I still need to think about whether that's a good thing to do and will come back to this later once I figure out.

00krishna commented 6 months ago

Excellent. Yeah, that is totally fair. I appreciate your consideration.

You are 100% correct that the issue seems to be that CategoricalArrays does not allow the user to specify the index values for each category. I am pulling the index values for my data samples from US census metadata, so they have their own elaborate system of categorizations.

But certainly, take your time to consider what you think is possible. Thanks again for your time.