transitmatters / data-ingestion

Crontab for data ingestion/processing on AWS Lambda
MIT License
2 stars 2 forks source link

Update how we generate `all_slow.json` #74

Closed cdwhitt closed 5 months ago

cdwhitt commented 5 months ago

We need to simplify the ByDirection<SlowZonesResponse[]> to ByDirection<SlowZoneResponse>, assuming there's only one active slow zone per track segment. This requires verification by reviewing the code that generates all_slow.json

The current all_slow.json doesn't contain historical values for every single day but aggregates over multiple days, which might lead to inaccuracies in duration for selected days.

cdwhitt commented 5 months ago

@idreyn Please review this when you have a few minutes as this goes back to our discussion from Thursday. Feel free to edit the description.

idreyn commented 5 months ago

I think we should actually consider generating an all_slow.json for every day in history — I think there'd be too much data to squeeze into that json file if we added a time series for every slow zone.

austinjpaul commented 5 months ago

I'm missing some context here, but one file per day would require reading a lot of files to do timeseries displays. Better would be one file per slowzone - but it's tricky because they can change in duration as our baseline drifts around. And we'd want 2 weeks context before or after. Maybe we're not talking about per-slowzone timeseries displays though, the traveltime graphs do that pretty well.

On Mon, Feb 5, 2024 at 11:48 AM Ian Reynolds @.***> wrote:

I think we should actually consider generating an all_slow.json for every day in history — I think there'd be too much data to squeeze into that json file if we added a time series for every slow zone.

— Reply to this email directly, view it on GitHub https://github.com/transitmatters/data-ingestion/issues/74#issuecomment-1927450287, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA26AROMNZHC6UTWFPKHTTYSEEOZAVCNFSM6AAAAABC2IL2DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGQ2TAMRYG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cdwhitt commented 5 months ago

I'm missing some context here, but one file per day would require reading a lot of files to do timeseries displays. Better would be one file per slowzone - but it's tricky because they can change in duration as our baseline drifts around. And we'd want 2 weeks context before or after. Maybe we're not talking about per-slowzone timeseries displays though, the traveltime graphs do that pretty well. On Mon, Feb 5, 2024 at 11:48 AM Ian Reynolds @.> wrote: I think we should actually consider generating an all_slow.json for every day in history — I think there'd be too much data to squeeze into that json file if we added a time series for every slow zone. — Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA26AROMNZHC6UTWFPKHTTYSEEOZAVCNFSM6AAAAABC2IL2DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGQ2TAMRYG4 . You are receiving this because you are subscribed to this thread.Message ID: @.>

We had discussed creating a dynamo table for this data, though I think that would incur a cost.

idreyn commented 5 months ago

I'm missing some context here, but one file per day would require reading a lot of files to do timeseries displays.

I think that's true, but I also think we are already doing all of those reads in analyze_for_slow(). We'd just need to hold on to the un-averaged travel time and slow time a little longer.

We had discussed creating a dynamo table for this data, though I think that would incur a cost.

We tend to be pretty liberal about spinning up Dynamo tables without worrying about their cost!

cdwhitt commented 5 months ago

Closing this as it is needed here instead: https://github.com/transitmatters/slow-zones/issues/43