transitmatters / gobble

🦃 Process MBTA events into a format that can be consumed by the Data Dashboard
MIT License
2 stars 3 forks source link

Only pull relevant columns from gtfs stops #24

Closed hamima-halim closed 11 months ago

hamima-halim commented 11 months ago

The GTFS stop times dataframe can take as much as >1.2 gigs in memory if we pull all of it (varies per gtfs package.)The actual .txt it's based off of only needs ~160 MB 🙃

Only reading in the useful columns takes our usage down from 1.2 gigs to ~0.86, and is filtered down to an eighth of that after later filtering.

**We might need to shrink this usage down even further later on: an approach here if we keep OOM-ing would be to first stream through the file with a more lightweight file reader, identify the # rows which contain the trips we care about, then feed those rows indices into the skiprows arg. Is this a nightmare? Is pandas a nightmare? Are we in a nightmare?