Allow stata to read arrow files

opensafely-core / stata-docker

Builds the opensafely stata docker image.

0 stars 0 forks source link

Allow stata to read arrow files #23

Closed rebkwok closed 1 year ago

rebkwok commented 1 year ago

See https://github.com/opensafely-core/databuilder/issues/794

Adds an arrowload command to allow stata to read arrow files, via a python script.

libraries/arrowload.ado provides the arrowload command; all it does is call the python script at python_scripts/load_arrow.py.

It's called with:

. arrowload /path/to/arrow/file

It can optionally be passed a value to use for setting the batch size when writing the records to the stata dataset.

The strategy taken is more or less as suggested by @evansd here: https://github.com/opensafely-core/databuilder/issues/794#issuecomment-1438474592, and borrows heavily from readit, but for arrow files only, using pyarrow.

rebkwok commented 1 year ago

I've tested this reading .feather output from cohort-extractor and .arrow output from databuilder

rebkwok commented 1 year ago

There's a bug that I've just noticed (I've been trying this with arrow files from the initiative 1 studies). Stata has a limit of 32 characters for variable names, and the sfi.SFIToolkit.strToName method I've used to clean the names just truncates them if they're too long. So when a study has lots of iterations over visits/events etc, and has variable names like is_opted_out_of_nhs_data_share_1, is_opted_out_of_nhs_data_share_2 etc the name-cleaning results in duplicate column names, which means it drops some columns.

I'm not sure what the best solution is - it could raise an error if the name-cleaning results in duplicate names, or if any names are >32 characters, and just prompt users to rename their variables to shorter names. Or I can do something to generate unique names, by truncating them in some way and adding an incremental suffix, but depending on how the variables were named in the first place, that could result in variables that aren't easily matched up to the ones in the input file.

rebkwok commented 1 year ago

This is now ready for review again:

made the command take optional arguments properly
option to load an aliases csv file and errors for invalid variable names
added a help file

rebkwok commented 1 year ago

@evansd I think this is ready for review again. I've addressed all of your comments, and done some more re-working after I made some more comprehensive test files and found bugs and slow things (esp the date/timestamp conversions). So I'm afraid it pretty much needs an entire re-review. I've kept all the incremental commits, which I think make reasonably logical sense.