Rides that didn't happen - Issue linking siri rides to gtfs rides

ShayAdler commented 1 year ago

I wanted to use the absence of a link between gtfs ride to a siri ride (siri_ride.gtfs_ride_id is null) to categorize this planned ride as a ride that did not happen. As far as I understand from the data model, this hypothesis makes sense (of course that there are other cases in which there is siri data and the ride did not happen \ finish successfully, but I wanted to start with the easiest case).

But when verifying the data, I noticed gaps between the siri ride counts to the siri rides I found unlinked. This is the flow I checked on Eged's line 470:

-- how many rides did and did not happen per day
select date_trunc('day', gr.start_time), sr.id is null didnt_happen, count(*) from gtfs_route
    join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
left join  siri_ride sr on gr.id = sr.gtfs_ride_id
where
gr.start_time > '2022-12-10 00:45:00.000000' and gtfs_route.line_ref=7005 and gtfs_route.operator_ref=3
GROUP BY date_trunc('day', gr.start_time), sr.id is null

Output -

Now I zoomed in to the 12.12.22 which seemed to have 5 missing rides-

-- how many gtfs rides happened in a specific date
select count(*) from gtfs_route
         join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
where line_ref=7005 and operator_ref=3 and date='2022-12-12'

-- how many siri rides happened in a specific date
select count(*) from siri_route
         join siri_ride sr on siri_route.id = sr.siri_route_id
where line_ref=7005 and operator_ref=3
and date_trunc('day', sr.scheduled_start_time) = '2022-12-12'

Both returned 39, meaning that probably all the planned rides did happen (maybe in different timings than expected?).

Than, I went looking at the ETL code that updates siri_ride.gtfs_ride_id (open-bus-stride-etl/open_bus_stride_etl/siri/update_rides_gtfs.py) I manually ran the query. A subtraction of all the siri ride ids that did have gtfs relation from all the siri rides that day, gave me the unlinked ids - {‘26074926’, ‘26077170’, ‘26094950’, ‘26100800’, ‘26106606’}

ETL Verification

Running the ETL query that links by route_gtfs_ride_id returns no results because siri_ride.journey_gtfs_ride_id is not null

select gtfs_route.date, *
from siri_ride
         -- we search for both options (comment out one when executing) -
         -- join gtfs_ride on gtfs_ride.id = siri_ride.journey_gtfs_ride_id
         join gtfs_ride on gtfs_ride.id = siri_ride.route_gtfs_ride_id
         join gtfs_route on gtfs_route.id = gtfs_ride.gtfs_route_id
    and gtfs_route.date = '2022-12-12'
    and siri_ride.journey_gtfs_ride_id is null
where siri_ride.id in ('26074926', '26077170', '26094950', '26100800', '26106606') -- rides that did not connect well
-- where siri_ride.id in ('26064288', '26065379') -- ride that are connected

Running the ETL query that links by journey_gtfs_ride_id returns no results either because for some reason in that case the gtfs_route.data is '2022-12-13'

select gtfs_route.date, *
from siri_ride
         -- we search for both options (comment out one when executing) -
                join gtfs_ride on gtfs_ride.id = siri_ride.journey_gtfs_ride_id
--          join gtfs_ride on gtfs_ride.id = siri_ride.route_gtfs_ride_id
         join gtfs_route on gtfs_route.id = gtfs_ride.gtfs_route_id
--     and gtfs_route.date = '2022-12-12'
--     and siri_ride.journey_gtfs_ride_id is null
where siri_ride.id in ('26074926', '26077170', '26094950', '26100800', '26106606') -- rides that did not connect well

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

ShayAdler commented 1 year ago

Might not be related, but as another sanity, I tried a simple comparison of the counts, and the results are weird (sometimes there are more siri rides than gtfs)

-- compare counts only
select date_part('dow', gtfs.date) day_of_week, gtfs.date, gtfs.count, siri.count, gtfs.count - siri.count missing_rides from
(select gtfs_route.date date, count(*) count from gtfs_route
         join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
where line_ref=7005 and operator_ref=3
and gtfs_route.date > '2022-12-10'
group by gtfs_route.date) AS gtfs
join (
select date_trunc('day', sr.scheduled_start_time) date, count(*) count from siri_route
         join siri_ride sr on siri_route.id = sr.siri_route_id
where line_ref=7005 and operator_ref=3
and date_trunc('day', sr.scheduled_start_time) > '2022-12-10'
group by date_trunc('day', sr.scheduled_start_time)) AS siri
on siri.date=gtfs.date;

Output -

Maybe the drivers had some spare time 😅

OriHoch commented 1 year ago

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

journey_gtfs_ride_id - matches gtfs_ride.journey_ref to siri_ride.journey_ref here
route_gtfs_ride_id - is more complex, it looks for matching operator_ref, line_ref, route_id and date, then looks at siri_ride.scheduled_start_time matched with gtfs_ride.start_time in a few intervals of minutes here

regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.

OriHoch commented 1 year ago

this ETL depends on a lot of other data from SIRI and GTFS, so there could be different places which cause the mismatch in rides. I suggest to go back to the source raw SIRI/GTFS data and try to match those rides there.

OriHoch commented 1 year ago

one more point to take into account when looking for missing data is if all the relevant data was loaded for the given times. If there was an error in the MOT interfaces or in our ETL you might see missing data for rides that did happen..

I started working on adding DB tables that show this but didn't finish.

However, there are some tables which show this data -

siri_snapshots - each scrape of MOT SIRI is shown there, recently they had some problems so you can see a lot of missing rows between 2022-12-23 11:59:37.199626 and 2022-12-24 15:41:51.211535
gtfs_data - shows the status of gtfs data processing for each date
gtfs_data_task - shows the status of gtfs etl tasks processing for each date

ShayAdler commented 1 year ago

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

journey_gtfs_ride_id - matches gtfs_ride.journey_ref to siri_ride.journey_ref here

route_gtfs_ride_id - is more complex, it looks for matching operator_ref, line_ref, route_id and date, then looks at siri_ride.scheduled_start_time matched with gtfs_ride.start_time in a few intervals of minutes here

regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.

So I took a specific siri ride that didn't match (26074926) and if we look at it's journey_ref (2145383_121222) we can see that it is not unique for some reason (we're looking in the raw data to find if we can find this duplication there as well)

For some reason, the one that's related to the 12.12 siri_ride is the gtfs_ride from the 13.13, even though I see you do filter by date - https://github.com/hasadna/open-bus-stride-etl/blob/6b1fc4a504fc7c927b6ab0b9cd45bb47d41dd920/open_bus_stride_etl/siri/update_rides_gtfs.py#L57 Do you have any idea why would it happen?

ShayAdler commented 1 year ago

If I try to manually join siri_rides to gtfs_ride using the same reformatting you do in the ETL, I see that only 10 out of 39 siri rides in that day link based on the journey_ref. Meaning that the majority are linked based on the route_gtfs_ride_id (which as you said is more heuristic).

select * from siri_route
         join siri_ride on siri_route.id = siri_ride.siri_route_id
         left join gtfs_ride gr on split_part(siri_ride.journey_ref, '-', 4) || '_' || split_part(siri_ride.journey_ref, '-', 3) || split_part(siri_ride.journey_ref, '-', 2) || substr(split_part(siri_ride.journey_ref, '-', 1), 3) = gr.journey_ref
where line_ref=7005 and operator_ref=3
and date_trunc('day', siri_ride.scheduled_start_time) = '2022-12-12'

If I take of of the journey refs from the siri rides that didn't match, they indeed don't match to any existing gtfs_ride select * from gtfs_ride where journey_ref like '3871134_121222' -- zero results

OriHoch commented 1 year ago

very good work! I think it shows a lot of different problems here..

select siri_ride.id siri_ride_id, siri_ride.journey_ref siri_journey_ref, siri_ride.scheduled_start_time siri_scheduled_start_time,
       siri_ride.journey_gtfs_ride_id, siri_ride.route_gtfs_ride_id,
       gtfs_ride.id gtfs_ride_id, gtfs_ride.journey_ref gtfs_journey_ref, gtfs_ride.start_time gtfs_start_time,
       gtfs_route.date gtfs_date
from siri_ride, gtfs_ride, gtfs_route where siri_ride.gtfs_ride_id in (27903105, 28012485)
and siri_ride.gtfs_ride_id = gtfs_ride.id
and gtfs_route.id = gtfs_ride.gtfs_route_id

siri_ride_id	siri_journey_ref	siri_scheduled_start_time	journey_gtfs_ride_id	route_gtfs_ride_id	gtfs_ride_id	gtfs_journey_ref	gtfs_start_time	gtfs_date
26179943	2022-12-13-2145399	2022-12-13 05:20:00.000000	null	28012485	28012485	2145383_121222	2022-12-13 05:20:00.000000	2022-12-13
26074926	2022-12-12-2145383	2022-12-12 05:20:00.000000	28012485	27903105	28012485	2145383_121222	2022-12-13 05:20:00.000000	2022-12-13

we see 2 siri rides here, 1 happened on 13/12 05:20 and one happend on 12/12 05:20, both are related to the same gtfs_ride from date 13/13, we shouldn't have 2 siri rides related to the same gtfs_ride
journey_ref 2022-12-13-2145399 / 2145399_121322 - does not exist in the gtfs data (when searching across all dates/rides), so this is strange and worth to check in the raw gtfs data if really such a journey ref doesn't exist
journey_ref 2022-12-12-2145383 / 2145383_121222 - according to siri this ride happened on 12/12, but according to gtfs it happened on 13/12. My guess is this has to do with our dates logic in the gtfs etl, we don't have a specific date defined in the gtfs data, but instead rely on the date in which we asked for the data. So, it's possible we asked for the data on 13/12, but we got back data for 12/12 too? This gtfs_date affects all the gtfs date/time fields, they are all taken from this date which is just the date when we asked for the data.
regarding how come the siri scheduled start time of 12/12 is related to gtfs date of 13/12 - this is possible because the limit of date is only for the gtfs date. So if for example we run the query with gtfs_route.date = '2022-12-13' and there is a siri ride with time of 12/12 which matches the journey_ref it will relate to it (this is something that shouldn't happen, as we assume that journey_ref is really a unique id per date, but I guess it's not..)

so, a lot of stuff to investigate and fix here :)

ShayAdler commented 1 year ago

Update: after more research, we found out that route_gtfs_ride_id is more accurate than journey_gtfs_ride_id (the opposite of the current logic).

I'm currently trying to fix the ETL, these are the changes I'm doing -

Add a new column (setting the others to deprecated) which will be the only one updated by the ETL and set gtfs_ride_id (We'll need to manually set gtfs_ride_id once to apply this chage).
Removing the interval match as we found out it is unnecessary

@OriHoch would like your help here regarding how to test this (I will open a PR later on)- Besides having a side script to validate the logic, I tried to load a sample db locally and play with the new queries I'm using but I'm not sure this is hermetic enough for such a big change.. Specifically, I'm worried about edge cases regarding edge-hours because we are filtering by gtfs date

-- a few hundreds every day
select count(1), gr.date from gtfs_ride
join gtfs_route gr on gtfs_ride.gtfs_route_id = gr.id
where date_trunc('day', gtfs_ride.start_time) != gr.date
group by gr.date
order by gr.date desc

OriHoch commented 1 year ago

I think the gtfs_date filter can be removed, we have other filters on scheduled_start_time / ride start_time which are enough (both contain also the date). The only logical problem I can think of is that we might have the same ride multiple times from different gtfs_dates (possibly with different ride start_times?).

Regarding testing, I don't see a problem to test on production.. given that current data is also not accurate

ShayAdler commented 1 year ago

wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily? If you mean that there could be 2 gtfs rides scheduled to the exact same start_time, I agree that it can happen in reality, and on that case, I think it can be a bit tricky to handle (match each siri ride to a different gtfs and not both to one of the gtfs_rides randomally), but I'm not sure how much it happens and therefore how much we care about this edge case.

Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?

ShayAdler commented 1 year ago

BTW, do you know how the current syntax you used for updating works id there is more than 1 match?

OriHoch commented 1 year ago

wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily?

yes but the process of getting and processing the gtfs data is complex, it's not a simple download for a specific date, so need to try to make as few assumptions as possible about the correctness of the data as possible.. the gtfs date is just the date when we downloaded the data, so what happens if for example the MOT data was not updated at a given day? Or if we missed a download day? We try to handle all these situations but just need to take into account that there might be problems..

Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?

sounds good

OriHoch commented 1 year ago

BTW, do you know how the current syntax you used for updating works if there is more than 1 match?

you need to make sure there is only 1 match.. if there is more then 1 match it will work but you can't know for sure which match will be used for the update

hasadna / open-bus-stride-etl

Rides that didn't happen - Issue linking siri rides to gtfs rides #10

ETL Verification