Open ShayAdler opened 1 year ago
Might not be related, but as another sanity, I tried a simple comparison of the counts, and the results are weird (sometimes there are more siri rides than gtfs)
-- compare counts only
select date_part('dow', gtfs.date) day_of_week, gtfs.date, gtfs.count, siri.count, gtfs.count - siri.count missing_rides from
(select gtfs_route.date date, count(*) count from gtfs_route
join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
where line_ref=7005 and operator_ref=3
and gtfs_route.date > '2022-12-10'
group by gtfs_route.date) AS gtfs
join (
select date_trunc('day', sr.scheduled_start_time) date, count(*) count from siri_route
join siri_ride sr on siri_route.id = sr.siri_route_id
where line_ref=7005 and operator_ref=3
and date_trunc('day', sr.scheduled_start_time) > '2022-12-10'
group by date_trunc('day', sr.scheduled_start_time)) AS siri
on siri.date=gtfs.date;
Output -
Maybe the drivers had some spare time 😅
@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between
journey_gtfs_ride_id
androute_gtfs_ride_id
and why when we link only when one of them is valid?
regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.
this ETL depends on a lot of other data from SIRI and GTFS, so there could be different places which cause the mismatch in rides. I suggest to go back to the source raw SIRI/GTFS data and try to match those rides there.
one more point to take into account when looking for missing data is if all the relevant data was loaded for the given times. If there was an error in the MOT interfaces or in our ETL you might see missing data for rides that did happen..
I started working on adding DB tables that show this but didn't finish.
However, there are some tables which show this data -
siri_snapshots
- each scrape of MOT SIRI is shown there, recently they had some problems so you can see a lot of missing rows between 2022-12-23 11:59:37.199626
and 2022-12-24 15:41:51.211535
gtfs_data
- shows the status of gtfs data processing for each dategtfs_data_task
- shows the status of gtfs etl tasks processing for each date@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between
journey_gtfs_ride_id
androute_gtfs_ride_id
and why when we link only when one of them is valid?
- journey_gtfs_ride_id - matches gtfs_ride.journey_ref to siri_ride.journey_ref here
- route_gtfs_ride_id - is more complex, it looks for matching operator_ref, line_ref, route_id and date, then looks at siri_ride.scheduled_start_time matched with gtfs_ride.start_time in a few intervals of minutes here
regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.
So I took a specific siri ride that didn't match (26074926) and if we look at it's journey_ref (2145383_121222) we can see that it is not unique for some reason (we're looking in the raw data to find if we can find this duplication there as well)
For some reason, the one that's related to the 12.12 siri_ride is the gtfs_ride from the 13.13, even though I see you do filter by date - https://github.com/hasadna/open-bus-stride-etl/blob/6b1fc4a504fc7c927b6ab0b9cd45bb47d41dd920/open_bus_stride_etl/siri/update_rides_gtfs.py#L57 Do you have any idea why would it happen?
If I try to manually join siri_rides to gtfs_ride using the same reformatting you do in the ETL, I see that only 10 out of 39 siri rides in that day link based on the journey_ref. Meaning that the majority are linked based on the route_gtfs_ride_id (which as you said is more heuristic).
select * from siri_route
join siri_ride on siri_route.id = siri_ride.siri_route_id
left join gtfs_ride gr on split_part(siri_ride.journey_ref, '-', 4) || '_' || split_part(siri_ride.journey_ref, '-', 3) || split_part(siri_ride.journey_ref, '-', 2) || substr(split_part(siri_ride.journey_ref, '-', 1), 3) = gr.journey_ref
where line_ref=7005 and operator_ref=3
and date_trunc('day', siri_ride.scheduled_start_time) = '2022-12-12'
If I take of of the journey refs from the siri rides that didn't match, they indeed don't match to any existing gtfs_ride
select * from gtfs_ride where journey_ref like '3871134_121222' -- zero results
very good work! I think it shows a lot of different problems here..
select siri_ride.id siri_ride_id, siri_ride.journey_ref siri_journey_ref, siri_ride.scheduled_start_time siri_scheduled_start_time,
siri_ride.journey_gtfs_ride_id, siri_ride.route_gtfs_ride_id,
gtfs_ride.id gtfs_ride_id, gtfs_ride.journey_ref gtfs_journey_ref, gtfs_ride.start_time gtfs_start_time,
gtfs_route.date gtfs_date
from siri_ride, gtfs_ride, gtfs_route where siri_ride.gtfs_ride_id in (27903105, 28012485)
and siri_ride.gtfs_ride_id = gtfs_ride.id
and gtfs_route.id = gtfs_ride.gtfs_route_id
siri_ride_id | siri_journey_ref | siri_scheduled_start_time | journey_gtfs_ride_id | route_gtfs_ride_id | gtfs_ride_id | gtfs_journey_ref | gtfs_start_time | gtfs_date |
---|---|---|---|---|---|---|---|---|
26179943 | 2022-12-13-2145399 | 2022-12-13 05:20:00.000000 | null | 28012485 | 28012485 | 2145383_121222 | 2022-12-13 05:20:00.000000 | 2022-12-13 |
26074926 | 2022-12-12-2145383 | 2022-12-12 05:20:00.000000 | 28012485 | 27903105 | 28012485 | 2145383_121222 | 2022-12-13 05:20:00.000000 | 2022-12-13 |
2022-12-13-2145399
/ 2145399_121322
- does not exist in the gtfs data (when searching across all dates/rides), so this is strange and worth to check in the raw gtfs data if really such a journey ref doesn't exist2022-12-12-2145383
/ 2145383_121222
- according to siri this ride happened on 12/12, but according to gtfs it happened on 13/12. My guess is this has to do with our dates logic in the gtfs etl, we don't have a specific date defined in the gtfs data, but instead rely on the date in which we asked for the data. So, it's possible we asked for the data on 13/12, but we got back data for 12/12 too? This gtfs_date affects all the gtfs date/time fields, they are all taken from this date which is just the date when we asked for the data.gtfs_route.date = '2022-12-13'
and there is a siri ride with time of 12/12 which matches the journey_ref it will relate to it (this is something that shouldn't happen, as we assume that journey_ref is really a unique id per date, but I guess it's not..)so, a lot of stuff to investigate and fix here :)
Update: after more research, we found out that route_gtfs_ride_id
is more accurate than journey_gtfs_ride_id
(the opposite of the current logic).
I'm currently trying to fix the ETL, these are the changes I'm doing -
gtfs_ride_id
(We'll need to manually set gtfs_ride_id
once to apply this chage).@OriHoch would like your help here regarding how to test this (I will open a PR later on)- Besides having a side script to validate the logic, I tried to load a sample db locally and play with the new queries I'm using but I'm not sure this is hermetic enough for such a big change.. Specifically, I'm worried about edge cases regarding edge-hours because we are filtering by gtfs date
-- a few hundreds every day
select count(1), gr.date from gtfs_ride
join gtfs_route gr on gtfs_ride.gtfs_route_id = gr.id
where date_trunc('day', gtfs_ride.start_time) != gr.date
group by gr.date
order by gr.date desc
I think the gtfs_date filter can be removed, we have other filters on scheduled_start_time / ride start_time which are enough (both contain also the date). The only logical problem I can think of is that we might have the same ride multiple times from different gtfs_dates (possibly with different ride start_times?).
Regarding testing, I don't see a problem to test on production.. given that current data is also not accurate
wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily? If you mean that there could be 2 gtfs rides scheduled to the exact same start_time, I agree that it can happen in reality, and on that case, I think it can be a bit tricky to handle (match each siri ride to a different gtfs and not both to one of the gtfs_rides randomally), but I'm not sure how much it happens and therefore how much we care about this edge case.
Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?
BTW, do you know how the current syntax you used for updating works id there is more than 1 match?
wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily?
yes but the process of getting and processing the gtfs data is complex, it's not a simple download for a specific date, so need to try to make as few assumptions as possible about the correctness of the data as possible.. the gtfs date is just the date when we downloaded the data, so what happens if for example the MOT data was not updated at a given day? Or if we missed a download day? We try to handle all these situations but just need to take into account that there might be problems..
Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?
sounds good
BTW, do you know how the current syntax you used for updating works if there is more than 1 match?
you need to make sure there is only 1 match.. if there is more then 1 match it will work but you can't know for sure which match will be used for the update
I wanted to use the absence of a link between gtfs ride to a siri ride (
siri_ride.gtfs_ride_id is null
) to categorize this planned ride as a ride that did not happen. As far as I understand from the data model, this hypothesis makes sense (of course that there are other cases in which there is siri data and the ride did not happen \ finish successfully, but I wanted to start with the easiest case).But when verifying the data, I noticed gaps between the siri ride counts to the siri rides I found unlinked. This is the flow I checked on Eged's line 470:
Output -
Now I zoomed in to the 12.12.22 which seemed to have 5 missing rides-
Both returned 39, meaning that probably all the planned rides did happen (maybe in different timings than expected?).
Than, I went looking at the ETL code that updates
siri_ride.gtfs_ride_id
(open-bus-stride-etl/open_bus_stride_etl/siri/update_rides_gtfs.py
) I manually ran the query. A subtraction of all the siri ride ids that did have gtfs relation from all the siri rides that day, gave me the unlinked ids - {‘26074926’, ‘26077170’, ‘26094950’, ‘26100800’, ‘26106606’}ETL Verification
Running the ETL query that links by
route_gtfs_ride_id
returns no results becausesiri_ride.journey_gtfs_ride_id
is not nullRunning the ETL query that links by
journey_gtfs_ride_id
returns no results either because for some reason in that case the gtfs_route.data is '2022-12-13'@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between
journey_gtfs_ride_id
androute_gtfs_ride_id
and why when we link only when one of them is valid?