nptscot / npt

Data processing code, also use this repo for issue tracking for the Network Planning Tool. See https://nptscot.github.io for development version
https://www.npt.scot/
GNU Affero General Public License v3.0
5 stars 0 forks source link

There are some duplicated route_id #375

Closed joeytalbot closed 6 months ago

joeytalbot commented 6 months ago

In the utility trips, some route_id are duplicated.

> tar_load(r_utility_fastest)
> dim(r_utility_fastest)
[1] 44051    21
> length(unique(r_utility_fastest$route_number))
[1] 44051
> length(unique(r_utility_fastest$route_id))
[1] 44037

I think this is because the route_id are generated separately for shopping, visiting and leisure trips, and when these are combined some happen to be identical.

I'm not sure why we need to create both route_id and route_number, but it might be better to use route_number instead as a grouping variable.

joeytalbot commented 6 months ago

For confirmation, in the commute and school data, each row has both a unique route_id and a unique route_number:

> tar_load(r_commute_fastest)
> dim(r_commute_fastest)
[1] 433255     18
> length(unique(r_commute_fastest$route_number))
[1] 433255
> length(unique(r_commute_fastest$route_id))
[1] 433255
> tar_load(r_school_fastest)
> dim(r_school_fastest)
[1] 55975    18
> length(unique(r_school_fastest$route_number))
[1] 55975
> length(unique(r_school_fastest$route_id))
[1] 55975
joeytalbot commented 6 months ago

route_id is used as a grouping variable in a large number of functions. We could either change all of these to use route_number instead, or we could reset route_id so it is unique for every row.

joeytalbot commented 6 months ago

Any preferences? @mem48 @Robinlovelace

joeytalbot commented 6 months ago

This is fixed in #377