transitmatters / mbta-performance

For processing performance data for the data dashboard
MIT License
0 stars 0 forks source link

Check for Parquet file changes #1

Open JNuss71 opened 3 months ago

JNuss71 commented 3 months ago

Is your feature request related to a problem? Please describe. Currently the parquet performance data is downloaded and processed on a regular schedule regardless of changes to the parquet file. This unnecessarily downloads and process parquet data that has already been processed.

Describe the solution you'd like Check whether the parquet performance data has changed by verifying the HTTP Response Headers for either the Etag or Last Modified date/time. If it hasn't changed then skip the downloading and processing step. If the file has been updated since the last time the parquet file was processed, download the new parquet file and process it.

devinmatte commented 3 months ago

For this we're going to need a way to keep track of when the last process time was. Since we're taking one file and processing it into hundreds, I don't think the files on s3 will give us a great idea of when we processed it last without checking all of them (as some files won't be updated every run)

JNuss71 commented 2 months ago

Is it worth storing this kind of data in a DynamoDB or would that just be an unnecessary introduction of a database?