Open pdeffebach opened 4 years ago
I would like to work on this as I have to deal with .dta
-Files quite regularly and I know the pain of handling Stata labels (in R or in general). I have also read the issues on adding metadata to dataframes and the discussion regarding metadata in DataAPI. As I believe to come from a similar context (lots of household survey data), I agree with a lof of the points @pdeffebach made there, especially about persistent metadata (like in Stata) being super useful. However, as there does not seem to be a great solution on the horizon, what would be the general idea to implement a solution that allows for a better workflow with .dta
-Files?
Is the idea to create a global dict which allows for swapping integer with string labels though some mapping based on column name? Should I look into Metadata.jl as a possible dependency for that? I have not worked with Metadata.jl before but as far as I understood it seems to use the approach of a global dict.
Might be that I need a lot of guidance as this is my first open-source contribution, sorry in advance.
I think a custom array type would handle this pretty easily. Something based off of CategoricalArrays.jl. But that might be a big task for someone doing their first open source contribution.
As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at
1
, and aDict
going fromInt => String
.Accessing the string values, which we generally care the most about, is hard with
ReadStat
. You have toDataFame
from the data and header fields 3 . Use thevalue_label_dict
field to perform the replacementget
on the DataValue elements of the arrayThis is not the most user friendly thing.
There isn't a great solution for this in Julia as we dont have a
CategoricalArray
equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.haven
in R recently made a change with how this is handled with the<dbl+lbl>
vector type. Though working with it is a bit of a pain, see here.I can email a data-set to someone with an MWE for more information.