hasadna / open-bus-stride-etl

ETL processing tasks for Stride data enrichment
MIT License
0 stars 1 forks source link

Rides that didn't happen - Issue linking siri rides to gtfs rides #10

Open ShayAdler opened 1 year ago

ShayAdler commented 1 year ago

I wanted to use the absence of a link between gtfs ride to a siri ride (siri_ride.gtfs_ride_id is null) to categorize this planned ride as a ride that did not happen. As far as I understand from the data model, this hypothesis makes sense (of course that there are other cases in which there is siri data and the ride did not happen \ finish successfully, but I wanted to start with the easiest case).

But when verifying the data, I noticed gaps between the siri ride counts to the siri rides I found unlinked. This is the flow I checked on Eged's line 470:

-- how many rides did and did not happen per day
select date_trunc('day', gr.start_time), sr.id is null didnt_happen, count(*) from gtfs_route
    join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
left join  siri_ride sr on gr.id = sr.gtfs_ride_id
where
gr.start_time > '2022-12-10 00:45:00.000000' and gtfs_route.line_ref=7005 and gtfs_route.operator_ref=3
GROUP BY date_trunc('day', gr.start_time), sr.id is null

Output -

image

Now I zoomed in to the 12.12.22 which seemed to have 5 missing rides-

-- how many gtfs rides happened in a specific date
select count(*) from gtfs_route
         join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
where line_ref=7005 and operator_ref=3 and date='2022-12-12'

-- how many siri rides happened in a specific date
select count(*) from siri_route
         join siri_ride sr on siri_route.id = sr.siri_route_id
where line_ref=7005 and operator_ref=3
and date_trunc('day', sr.scheduled_start_time) = '2022-12-12'

Both returned 39, meaning that probably all the planned rides did happen (maybe in different timings than expected?).

Than, I went looking at the ETL code that updates siri_ride.gtfs_ride_id (open-bus-stride-etl/open_bus_stride_etl/siri/update_rides_gtfs.py) I manually ran the query. A subtraction of all the siri ride ids that did have gtfs relation from all the siri rides that day, gave me the unlinked ids - {‘26074926’, ‘26077170’, ‘26094950’, ‘26100800’, ‘26106606’}

ETL Verification

Running the ETL query that links by route_gtfs_ride_id returns no results because siri_ride.journey_gtfs_ride_id is not null

select gtfs_route.date, *
from siri_ride
         -- we search for both options (comment out one when executing) -
         -- join gtfs_ride on gtfs_ride.id = siri_ride.journey_gtfs_ride_id
         join gtfs_ride on gtfs_ride.id = siri_ride.route_gtfs_ride_id
         join gtfs_route on gtfs_route.id = gtfs_ride.gtfs_route_id
    and gtfs_route.date = '2022-12-12'
    and siri_ride.journey_gtfs_ride_id is null
where siri_ride.id in ('26074926', '26077170', '26094950', '26100800', '26106606') -- rides that did not connect well
-- where siri_ride.id in ('26064288', '26065379') -- ride that are connected

Running the ETL query that links by journey_gtfs_ride_id returns no results either because for some reason in that case the gtfs_route.data is '2022-12-13'

select gtfs_route.date, *
from siri_ride
         -- we search for both options (comment out one when executing) -
                join gtfs_ride on gtfs_ride.id = siri_ride.journey_gtfs_ride_id
--          join gtfs_ride on gtfs_ride.id = siri_ride.route_gtfs_ride_id
         join gtfs_route on gtfs_route.id = gtfs_ride.gtfs_route_id
--     and gtfs_route.date = '2022-12-12'
--     and siri_ride.journey_gtfs_ride_id is null
where siri_ride.id in ('26074926', '26077170', '26094950', '26100800', '26106606') -- rides that did not connect well

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

ShayAdler commented 1 year ago

Might not be related, but as another sanity, I tried a simple comparison of the counts, and the results are weird (sometimes there are more siri rides than gtfs)

-- compare counts only
select date_part('dow', gtfs.date) day_of_week, gtfs.date, gtfs.count, siri.count, gtfs.count - siri.count missing_rides from
(select gtfs_route.date date, count(*) count from gtfs_route
         join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
where line_ref=7005 and operator_ref=3
and gtfs_route.date > '2022-12-10'
group by gtfs_route.date) AS gtfs
join (
select date_trunc('day', sr.scheduled_start_time) date, count(*) count from siri_route
         join siri_ride sr on siri_route.id = sr.siri_route_id
where line_ref=7005 and operator_ref=3
and date_trunc('day', sr.scheduled_start_time) > '2022-12-10'
group by date_trunc('day', sr.scheduled_start_time)) AS siri
on siri.date=gtfs.date;

Output -

image

Maybe the drivers had some spare time 😅

OriHoch commented 1 year ago

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.

OriHoch commented 1 year ago

this ETL depends on a lot of other data from SIRI and GTFS, so there could be different places which cause the mismatch in rides. I suggest to go back to the source raw SIRI/GTFS data and try to match those rides there.

OriHoch commented 1 year ago

one more point to take into account when looking for missing data is if all the relevant data was loaded for the given times. If there was an error in the MOT interfaces or in our ETL you might see missing data for rides that did happen..

I started working on adding DB tables that show this but didn't finish.

However, there are some tables which show this data -

ShayAdler commented 1 year ago

@OriHoch I'd very much like your help to understand the full flow of this ETL and specifically - what are the differences between journey_gtfs_ride_id and route_gtfs_ride_id and why when we link only when one of them is valid?

  • journey_gtfs_ride_id - matches gtfs_ride.journey_ref to siri_ride.journey_ref here
  • route_gtfs_ride_id - is more complex, it looks for matching operator_ref, line_ref, route_id and date, then looks at siri_ride.scheduled_start_time matched with gtfs_ride.start_time in a few intervals of minutes here

regarding linking - we prefer the journey_gtfs_ride_id as it's more exact based on the journey_ref... I don't remember why I added the other method, I guess I found some rides which didn't match based on journey_ref.

So I took a specific siri ride that didn't match (26074926) and if we look at it's journey_ref (2145383_121222) we can see that it is not unique for some reason (we're looking in the raw data to find if we can find this duplication there as well)

image

For some reason, the one that's related to the 12.12 siri_ride is the gtfs_ride from the 13.13, even though I see you do filter by date - https://github.com/hasadna/open-bus-stride-etl/blob/6b1fc4a504fc7c927b6ab0b9cd45bb47d41dd920/open_bus_stride_etl/siri/update_rides_gtfs.py#L57 Do you have any idea why would it happen?

ShayAdler commented 1 year ago

If I try to manually join siri_rides to gtfs_ride using the same reformatting you do in the ETL, I see that only 10 out of 39 siri rides in that day link based on the journey_ref. Meaning that the majority are linked based on the route_gtfs_ride_id (which as you said is more heuristic).

select * from siri_route
         join siri_ride on siri_route.id = siri_ride.siri_route_id
         left join gtfs_ride gr on split_part(siri_ride.journey_ref, '-', 4) || '_' || split_part(siri_ride.journey_ref, '-', 3) || split_part(siri_ride.journey_ref, '-', 2) || substr(split_part(siri_ride.journey_ref, '-', 1), 3) = gr.journey_ref
where line_ref=7005 and operator_ref=3
and date_trunc('day', siri_ride.scheduled_start_time) = '2022-12-12'

If I take of of the journey refs from the siri rides that didn't match, they indeed don't match to any existing gtfs_ride select * from gtfs_ride where journey_ref like '3871134_121222' -- zero results

OriHoch commented 1 year ago

very good work! I think it shows a lot of different problems here..

select siri_ride.id siri_ride_id, siri_ride.journey_ref siri_journey_ref, siri_ride.scheduled_start_time siri_scheduled_start_time,
       siri_ride.journey_gtfs_ride_id, siri_ride.route_gtfs_ride_id,
       gtfs_ride.id gtfs_ride_id, gtfs_ride.journey_ref gtfs_journey_ref, gtfs_ride.start_time gtfs_start_time,
       gtfs_route.date gtfs_date
from siri_ride, gtfs_ride, gtfs_route where siri_ride.gtfs_ride_id in (27903105, 28012485)
and siri_ride.gtfs_ride_id = gtfs_ride.id
and gtfs_route.id = gtfs_ride.gtfs_route_id
siri_ride_id siri_journey_ref siri_scheduled_start_time journey_gtfs_ride_id route_gtfs_ride_id gtfs_ride_id gtfs_journey_ref gtfs_start_time gtfs_date
26179943 2022-12-13-2145399 2022-12-13 05:20:00.000000 null 28012485 28012485 2145383_121222 2022-12-13 05:20:00.000000 2022-12-13
26074926 2022-12-12-2145383 2022-12-12 05:20:00.000000 28012485 27903105 28012485 2145383_121222 2022-12-13 05:20:00.000000 2022-12-13

so, a lot of stuff to investigate and fix here :)

ShayAdler commented 1 year ago

Update: after more research, we found out that route_gtfs_ride_id is more accurate than journey_gtfs_ride_id (the opposite of the current logic).

I'm currently trying to fix the ETL, these are the changes I'm doing -

@OriHoch would like your help here regarding how to test this (I will open a PR later on)- Besides having a side script to validate the logic, I tried to load a sample db locally and play with the new queries I'm using but I'm not sure this is hermetic enough for such a big change.. Specifically, I'm worried about edge cases regarding edge-hours because we are filtering by gtfs date

-- a few hundreds every day
select count(1), gr.date from gtfs_ride
join gtfs_route gr on gtfs_ride.gtfs_route_id = gr.id
where date_trunc('day', gtfs_ride.start_time) != gr.date
group by gr.date
order by gr.date desc
OriHoch commented 1 year ago

I think the gtfs_date filter can be removed, we have other filters on scheduled_start_time / ride start_time which are enough (both contain also the date). The only logical problem I can think of is that we might have the same ride multiple times from different gtfs_dates (possibly with different ride start_times?).

Regarding testing, I don't see a problem to test on production.. given that current data is also not accurate

ShayAdler commented 1 year ago

wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily? If you mean that there could be 2 gtfs rides scheduled to the exact same start_time, I agree that it can happen in reality, and on that case, I think it can be a bit tricky to handle (match each siri ride to a different gtfs and not both to one of the gtfs_rides randomally), but I'm not sure how much it happens and therefore how much we care about this edge case.

Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?

ShayAdler commented 1 year ago

BTW, do you know how the current syntax you used for updating works id there is more than 1 match?

OriHoch commented 1 year ago

wdym by "same ride multiple times from different gtfs_dates"? aren't gtfs_rides created daily?

yes but the process of getting and processing the gtfs data is complex, it's not a simple download for a specific date, so need to try to make as few assumptions as possible about the correctness of the data as possible.. the gtfs date is just the date when we downloaded the data, so what happens if for example the MOT data was not updated at a given day? Or if we missed a download day? We try to handle all these situations but just need to take into account that there might be problems..

Regarding testing, how about first adding the internal column, and then after more validations that I'll do on the entire dataset we'll move the FQ to point on my new column?

sounds good

OriHoch commented 1 year ago

BTW, do you know how the current syntax you used for updating works if there is more than 1 match?

you need to make sure there is only 1 match.. if there is more then 1 match it will work but you can't know for sure which match will be used for the update