r-transit / tidytransit

R package for working with GTFS data
https://r-transit.github.io/tidytransit/
150 stars 22 forks source link

Improve duplicated primary key check #203

Closed polettif closed 1 year ago

polettif commented 1 year ago

As it turns out, the implemented check for duplicated primary keys is very slow on larger feeds. This PR improves the runtime by about 10x (!) by using data.table::anyDuplicated. Example benchmarks:

With the NYC dataset

nyc = gtfsio::import_gtfs(system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit"))
duplicated_primary_keys(nyc)
Unit: milliseconds
 expr       min       lq      mean    median        uq      max neval
  old 531.28157 624.1847 673.41524 696.80770 714.99092 763.3479    20
  new  26.70495  30.0662  43.68424  33.98627  46.30624 106.0321    20

With the Swiss dataset:

Unit: seconds
 expr       min        lq      mean    median        uq       max neval
  old 16.113029 17.393730 18.626484 17.964645 20.772904 20.888113     5
  new  1.063265  1.117334  1.569431  1.244815  1.675949  2.745791     5

Maybe there's faster ways to check for duplicated keys but I can't think of a better way currently. Checking for unique keys in large datasets is bound to come with some cost.

codecov-commenter commented 1 year ago

Codecov Report

Merging #203 (d6924d2) into master (18cf9b4) will increase coverage by 0.00%. The diff coverage is 100.00%.

:exclamation: Current head d6924d2 differs from pull request most recent head 40719ac. Consider uploading reports for the commit 40719ac to get more accurate results

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##           master     #203   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files          16       16           
  Lines        1119     1120    +1     
=======================================
+ Hits         1118     1119    +1     
  Misses          1        1           
Impacted Files Coverage Δ
R/validate_gtfs.R 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes