mcaceresb / stata-parquet

Read and write parquet files from Stata
MIT License
22 stars 6 forks source link

Merge support #8

Closed kylebarron closed 5 years ago

kylebarron commented 6 years ago

This is really Dan's feature request. Just something to put on a possible long term roadmap.

Here I create 10 million observations of an id variable, and merge the file with itself:

set rmsg on
set obs 10000000
gen id = _n
sort id
save a,replace
clear
use a
merge 1:1 id using a
exit,clear

On my system the use takes .05 seconds and the merge takes 1 second. Unsorted the merge takes 1.7 seconds. You can add more variables and the ratio gets a little better, but the total time difference gets much larger. I tried fmerge, but that didn't help any.

mcaceresb commented 6 years ago

I think there is some merge support in arrow, but the low-level API is so poorly documented that it might take a while.

parquet merge [m:1 or 1:1] using file.parquet

I don't think I could implement 1:m efficiently (and m:m is evil). I suppose this could also work

parquet use file.parquet, merge([m:1 or 1:1 or 1:m] using file.parquet)
parquet query [SQL-like query? e.g. select * from file.parquet left join file.parquet etc.]
mcaceresb commented 6 years ago

I don't want to re-invent the wheel with merges, though. If I can use some other library to merge and just this to read to Stata once it's done that'd be great.

kylebarron commented 6 years ago

I don't want to re-invent the wheel with merges, though. If I can use some other library to merge and just this to read to Stata once it's done that'd be great.

Yes I agree.

kylebarron commented 5 years ago

Dan pointed me to this: https://github.com/kylebarron/ftools/commit/c8072fa2350f811f5626657350a60c8379353189

kylebarron commented 5 years ago

This can probably be closed