transitmatters / mbta-performance

For processing performance data for the data dashboard
MIT License
0 stars 0 forks source link

Assign nullable dtypes to dataframe columns #7

Closed hamima-halim closed 3 months ago

hamima-halim commented 3 months ago

WIP.

A pr to fix: https://github.com/transitmatters/mbta-performance/issues/4

Even though parquet files have explicit per-column dtype metadata, pandas will overwrite these instructions for nullable integer columns and assign them as floats. Down the line, this causes overflow errors when numpy is trying to recast the epoch timestamps into datetimes. More info: https://pandas.pydata.org/docs/user_guide/integer_na.html#nullable-integer-data-type

Tests to come.

devinmatte commented 3 months ago

Would be cool with this being merged as is and tests come in a second PR, but also fine to re-review later if you want to do tests in this same PR