Update Fannie Mae dataset for consistency

voltrondata-labs / benchmarks

Language-independent Continuous Benchmarking (CB) for Apache Arrow

MIT License

10 stars 11 forks source link

Update Fannie Mae dataset for consistency #114

Closed alistaire47 closed 2 years ago

alistaire47 commented 2 years ago

This PR adds a schema for the Fannie Mae dataset with improved variable names and types and replaces the sample dataset in parallel with https://github.com/ursacomputing/arrowbench/pull/107. Comments by the schema document where to find more information about the data, should it be needed.

jonkeane commented 2 years ago

Are the test failures waiting for https://github.com/ursacomputing/arrowbench/pull/107 to be merged? Or is something else going on there?

alistaire47 commented 2 years ago

Are the test failures waiting for ursacomputing/arrowbench#107 to be merged? Or is something else going on there?

There's something else; I'm debugging. I think types are getting changed in some places differently (for other datasets like nyctaxi_sample), and haven't yet figured out why

alistaire47 commented 2 years ago

I think types are getting changed in some places differently (for other datasets like nyctaxi_sample), and haven't yet figured out why

oh pretty sure this is because I switched from reading CSVs with pandas to pyarrow and the type inference is different

alistaire47 commented 2 years ago

Ok our timestamp precision consistency is not great; round-tripping we vary between timestamp[ns], timestamp[us] and sometimes timestamp[s], even when I try to enforce a schema

alistaire47 commented 2 years ago

@jonkeane This is fixed and ready for review now