Closed rebkwok closed 1 year ago
I've tested this reading .feather output from cohort-extractor and .arrow output from databuilder
There's a bug that I've just noticed (I've been trying this with arrow files from the initiative 1 studies).
Stata has a limit of 32 characters for variable names, and the sfi.SFIToolkit.strToName
method I've used to clean the names just truncates them if they're too long. So when a study has lots of iterations over visits/events etc, and has variable names like is_opted_out_of_nhs_data_share_1
, is_opted_out_of_nhs_data_share_2
etc the name-cleaning results in duplicate column names, which means it drops some columns.
I'm not sure what the best solution is - it could raise an error if the name-cleaning results in duplicate names, or if any names are >32 characters, and just prompt users to rename their variables to shorter names. Or I can do something to generate unique names, by truncating them in some way and adding an incremental suffix, but depending on how the variables were named in the first place, that could result in variables that aren't easily matched up to the ones in the input file.
This is now ready for review again:
@evansd I think this is ready for review again. I've addressed all of your comments, and done some more re-working after I made some more comprehensive test files and found bugs and slow things (esp the date/timestamp conversions). So I'm afraid it pretty much needs an entire re-review. I've kept all the incremental commits, which I think make reasonably logical sense.
See https://github.com/opensafely-core/databuilder/issues/794
Adds an
arrowload
command to allow stata to read arrow files, via a python script.libraries/arrowload.ado
provides thearrowload
command; all it does is call the python script atpython_scripts/load_arrow.py
.It's called with:
It can optionally be passed a value to use for setting the batch size when writing the records to the stata dataset.
The strategy taken is more or less as suggested by @evansd here: https://github.com/opensafely-core/databuilder/issues/794#issuecomment-1438474592, and borrows heavily from readit, but for arrow files only, using pyarrow.