All the data manipulations performed by ghcnd_splitvars() can be very slow. In my machine, processing one station id can take almost 2 seconds:
library(rnoaa)
station <- ghcnd_stations()
station <- ghcnd(station$id[1])
system.time(ghcnd_splitvars(station))
#> user system elapsed
#> 1.941 0.004 1.948
The culprit is all these dplyr manipulations and tidyr::gather() calls that are somewhat redundant. I experimented a little using data.table::melt() and got dramatically better performance:
library(rnoaa) # Using eliocamp/rnoaa@dt-ghcnd_splitvars
station <- ghcnd_stations()
station <- ghcnd(station$id[1])
bench::mark(new = ghcnd_splitvars(station),
old = rnoaa:::ghcnd_splitvars2(station))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 58.13ms 60.75ms 14.7 4.64MB 1.84
#> 2 old 1.93s 1.93s 0.519 12.16MB 2.08
All the data manipulations performed by
ghcnd_splitvars()
can be very slow. In my machine, processing one station id can take almost 2 seconds:The culprit is all these dplyr manipulations and
tidyr::gather()
calls that are somewhat redundant. I experimented a little usingdata.table::melt()
and got dramatically better performance:Created on 2020-06-04 by the reprex package (v0.3.0)
So, from 2s to less than 60ms!
If you like, I can create a PR with the change (minus
ghcnd_splitvars2
, of course).