queryverse / ReadStat.jl

Read files from Stata, SAS, and SPSS
MIT License
78 stars 21 forks source link

Reading `.dta` with value labels #74

Open pdeffebach opened 4 years ago

pdeffebach commented 4 years ago

As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at 1, and a Dict going from Int => String.

Accessing the string values, which we generally care the most about, is hard with ReadStat. You have to

  1. Use ReadStat not StatFiles to access the internal fields of the Stata File
  2. Construct the DataFame from the data and header fields 3 . Use the value_label_dict field to perform the replacement
  3. Use get on the DataValue elements of the array

This is not the most user friendly thing.

There isn't a great solution for this in Julia as we dont have a CategoricalArray equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.

haven in R recently made a change with how this is handled with the <dbl+lbl> vector type. Though working with it is a bit of a pain, see here.

I can email a data-set to someone with an MWE for more information.

doriantsolak commented 2 years ago

I would like to work on this as I have to deal with .dta-Files quite regularly and I know the pain of handling Stata labels (in R or in general). I have also read the issues on adding metadata to dataframes and the discussion regarding metadata in DataAPI. As I believe to come from a similar context (lots of household survey data), I agree with a lof of the points @pdeffebach made there, especially about persistent metadata (like in Stata) being super useful. However, as there does not seem to be a great solution on the horizon, what would be the general idea to implement a solution that allows for a better workflow with .dta-Files?

Is the idea to create a global dict which allows for swapping integer with string labels though some mapping based on column name? Should I look into Metadata.jl as a possible dependency for that? I have not worked with Metadata.jl before but as far as I understood it seems to use the approach of a global dict.

Might be that I need a lot of guidance as this is my first open-source contribution, sorry in advance.

pdeffebach commented 2 years ago

I think a custom array type would handle this pretty easily. Something based off of CategoricalArrays.jl. But that might be a big task for someone doing their first open source contribution.