ghcnd_splitvars can be faster

eliocamp commented 4 years ago

All the data manipulations performed by ghcnd_splitvars() can be very slow. In my machine, processing one station id can take almost 2 seconds:

library(rnoaa)

station <- ghcnd_stations()
station <- ghcnd(station$id[1])

system.time(ghcnd_splitvars(station))
#>    user  system elapsed 
#>   1.941   0.004   1.948

The culprit is all these dplyr manipulations and tidyr::gather() calls that are somewhat redundant. I experimented a little using data.table::melt() and got dramatically better performance:

library(rnoaa) # Using eliocamp/rnoaa@dt-ghcnd_splitvars

station <- ghcnd_stations()
station <- ghcnd(station$id[1])

bench::mark(new = ghcnd_splitvars(station),
            old = rnoaa:::ghcnd_splitvars2(station))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new         58.13ms  60.75ms    14.7      4.64MB     1.84
#> 2 old           1.93s    1.93s     0.519   12.16MB     2.08

^{Created on 2020-06-04 by the reprex package (v0.3.0)}

So, from 2s to less than 60ms!

If you like, I can create a PR with the change (minus ghcnd_splitvars2, of course).

sckott commented 4 years ago

Thanks @eliocamp - A speed up would be nice. A PR sounds good.

eliocamp commented 4 years ago

I opened the PR. I checked that the output is the same as best I could and the only tests that fail are unrelated to the function.

ropensci / rnoaa

ghcnd_splitvars can be faster #352