voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

Harmonize Fannie Mae dataset and sample #88

Closed alistaire47 closed 2 years ago

alistaire47 commented 2 years ago

Per https://github.com/ursacomputing/arrowbench/pull/87#discussion_r865188670 harmonize the full and sample Fannie Mae sources. Currently the sample has a bunch of null columns (it has 108 cols of which 61 have non-null data; the full dataset has 31 cols of which 29 are non-null) and neither has column names, so operations that rely on names generated from positions fail.

After this story, the only difference should be the number of rows. If we can find some real column names, that would be ideal.

jonkeane commented 2 years ago

We'll also want to harmonize with ursacomputing/benchmarks at https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/data/fanniemae_sample.csv too