ropensci / stplanr

Sustainable transport planning with R
https://docs.ropensci.org/stplanr
Other
419 stars 66 forks source link

Switch parts of overline to data.table #517

Closed mem48 closed 1 year ago

mem48 commented 1 year ago

@Robinlovelace a very quick look at speeding up overline2. I've replaced some dplyr code with data.table

bench::mark(t1 = stplanr::overline2(r, c("bicycle","car_driver","all")),
            t2 = overline2(r, attrib = c("bicycle","car_driver","all")),
            check = FALSE)

# A tibble: 2 × 13
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 t1          45.3s  45.3s    0.0221   13.46GB    0.728     1    33      45.3s <NULL>
2 t2          24.6s  24.6s    0.0406    2.06GB    1.50      1    37      24.6s <NULL>

t1 <- t1[order(t1$bicycle, t1$all),]
t2 <- t2[order(t2$bicycle, t2$all),]
identical(t1, t2) # TRUE

Resutls come out in a different order, but I don't think that matters. I've not tested all use cases like large data and use multicore.

Its a modest speed inmprovements but a major reduction in memory usage which should help.

There is another bit of dplyr code I can't work out how to do in data.table

sls <- dplyr::group_by_at(sl, c("1", "2", "3", "4"))
sls <- dplyr::ungroup(dplyr::summarise_all(sls, .funs = fun))

but the fix I have made is to only make one object sls rather than making an slg object that is never used again.

mem48 commented 1 year ago

Test done with

r = readRDS("../../nptscot/npt/outputdata/2023-07-09/routes_max_dist_commute_fastest.Rds")
nrow(r) # 257284

So 257284 segments in Edinburgh

Robinlovelace commented 1 year ago

Sounds promising. If I recall correctly I was seeing more than a 2x speed-up in tests I was doing a few weeks ago: https://github.com/Robinlovelace/overline-tests

Just opened-up, was aiming for the tests to be a bit more ready before.

:+1: to speeding-up overline!

mem48 commented 1 year ago

@Robinlovelace I've now checked this and got the multicore support working, although it seems to matter a lot less with data.table can we get this merged as it would benefit the NPT builds

mem48 commented 1 year ago

Sounds promising. If I recall correctly I was seeing more than a 2x speed-up in tests I was doing a few weeks ago: https://github.com/Robinlovelace/overline-tests

Perhaps I'm not understanding, but I only see small speedups on some of the faster parts of the process?

Robinlovelace commented 1 year ago

Will take a look.

Robinlovelace commented 1 year ago

can we get this merged as it would benefit the NPT builds

You can use this branch with

remotes::install_github("ropensci/stplanr@overline-dt")
Robinlovelace commented 1 year ago

5x reduction in memory use = v. promising.

mpadge commented 1 year ago

@Robinlovelace @mem48 FYI After a heap of CRAN rejections for one of my pkgs with lots of data.table usage, I learnt that new CRAN checks mean that I needed this line at the top of all examples, tests, and vignettes:

data.table::setDTthreads(1L)

Without that, CRAN autochecks always rejected because of "ratio of CPU to elaped time > X". In case that helps.

Robinlovelace commented 1 year ago

Handy. Probably over-due a CRAN release. Thanks Mark!