use CSV for storing per-route state instead of JSON for better performance

youngj commented 4 years ago

When fetching the state from tryn-api to compute arrivals, profiling showed that a significant amount of running time was spent in read_temp_chunk_state, which loaded a cache file in CSV format into memory, converted the data structure, and wrote a file in JSON format. However, the JSON format was only used by eclipses.produce_buses to initialize a DataFrame, which could also be done from the original CSV format without using JSON.

This PR removes the JSON format and updates CachedState.get_for_route to return the DataFrame that used to be returned by eclipses.produce_buses. eclipses.find_arrivals is updated to take the DataFrame as a parameter instead of a dict.

youngj commented 4 years ago

For a daily TriMet run, the runtime just to fetch and cache the state was was 6m43s for this version compared to 11m46s with the JSON version, so it saved about 5 minutes. (time python get_state.py --agency=trimet --date=2020-03-25)

These .csv files are only cached locally within the Docker container and aren't saved to S3 (neither were the .json files) so there is no difference in storage costs. The .json.gz files containing the arrival times saved by compute_arrivals.py should be identical to the previous .json.gz files.

EddyIonescu commented 4 years ago

I see, this just removes the JSON write/read that wasn't needed anymore - so no plans to change the format of anything outputted to S3. Thanks for doing https://github.com/trynmaps/metrics-mvp/issues/609!

trynmaps / metrics-mvp

use CSV for storing per-route state instead of JSON for better performance #618