rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Better performance with big datasets #277

Closed pitkant closed 8 months ago

pitkant commented 11 months ago

As mentioned in #98 big datasets can cause R to hang / crash. I also myself experienced this when running tests on weekly data (see issue #200 ), which is obviously bigger than monthly/yearly data.

I have been tested replacing some old functions with data.table based implementations on migr_asyappctzm dataset and got the following promising benchmarks (best case scenarios in bold):

get_eurostat_raw

tidy_eurostat

saveRDS

So in the worst case scenario the whole processing took 4 min 20 sec whereas in the best case scenario it took 1 min 20 sec. Some seconds could be shaved off by not compressing the cache file or by turning cache off but I don't think it's worth it. The added benefit of using data.table and its := operations was that no copies of the object in memory are made as operations are made in place:

A shallow copy is just a copy of the vector of column pointers (corresponding to the columns in a data.frame or data.table). The actual data is not physically copied in memory.

A deep copy on the other hand copies the entire data to another location in memory.

With data.table’s := operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because := operator updates data.table columns in-place (by reference).

I think there might still be some quirks in utilising data.table as opposed to the old method so I've made data.table code optional for now. Feedback and suggestions very much welcome in this issue or the forthcoming PR I will link to this issue.

pitkant commented 11 months ago

For even better performance with big datasets I guess the option would be to use packages like ff (see bigdata2018 course materials, example of loading big kaggle data, r-bloggers blog post) or bigmemory (see bibmemory overview vignette from 2010) or some database format, such as RSQLite (see RSQLite vignette). I would personally prefer using a database approach.

An experimental branch could be created for testing this approach but I would not include these features in 4.0.0.

pitkant commented 11 months ago

Big datasets with inadequate amounts of RAM seem to be quite torturous to the system. Tested with a cloud-based Windows 10 Enterprise with Intel Xeon 6242R and 6 GB of allocated RAM downloading and handling aforementioned dataset without data.table failed after 20 minutes to an error "cannot allocate vector size of 614.9 Mb". It was actually surprising that the process was able to go on as long as it did as it seemed to include quite a lot of swapping on HDD, reads and writes were pretty high when running the function. When attempting to use data.table functions it crashed R as well, but it crashed a lot quicker so maybe it wastes less of your time...

Some potential improvements:

Some more general notes on the problems I encountered while working with a Virtual Machine / Windows:

Session info ``` R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=Finnish_Finland.utf8 [2] LC_CTYPE=Finnish_Finland.utf8 [3] LC_MONETARY=Finnish_Finland.utf8 [4] LC_NUMERIC=C [5] LC_TIME=Finnish_Finland.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] eurostat_4.0.0.9005 loaded via a namespace (and not attached): [1] tidyselect_1.2.0 httr2_0.2.3 remotes_2.4.2.1 [4] purrr_1.0.2 generics_0.1.3 vctrs_0.6.3 [7] utf8_1.2.3 rlang_1.1.1 pkgbuild_1.3.1 [10] e1071_1.7-13 pillar_1.9.0 glue_1.6.2 [13] withr_2.5.1 rappdirs_0.3.3 bit64_4.0.5 [16] readxl_1.4.3 lifecycle_1.0.3 plyr_1.8.8 [19] stringr_1.5.0 cellranger_1.1.0 callr_3.7.1 [22] tzdb_0.4.0 ps_1.7.1 parallel_4.2.1 [25] curl_4.3.2 class_7.3-20 fansi_1.0.4 [28] Rcpp_1.0.11 KernSmooth_2.23-20 readr_2.1.4 [31] backports_1.4.1 classInt_0.4-10 vroom_1.6.3 [34] jsonlite_1.8.7 bit_4.0.5 countrycode_1.5.0 [37] hms_1.1.3 digest_0.6.33 stringi_1.7.12 [40] processx_3.7.0 dplyr_1.1.3 rprojroot_2.0.3 [43] bibtex_0.5.1 here_1.0.1 ISOweek_0.6-2 [46] cli_3.6.1 tools_4.2.1 magrittr_2.0.3 [49] proxy_0.4-27 tibble_3.2.1 RefManageR_1.4.0 [52] tidyr_1.3.0 crayon_1.5.2 pkgconfig_2.0.3 [55] regions_0.1.8 data.table_1.14.8 xml2_1.3.5 [58] prettyunits_1.2.0 timechange_0.2.0 lubridate_1.9.3 [61] assertthat_0.2.1 httr_1.4.7 rstudioapi_0.13 [64] R6_2.5.1 compiler_4.2.1 ```
pitkant commented 11 months ago

Me and @ake123 ran tests with migr_asyappctzm dataset and were able to handle it (albeit slowly) even on a Mac with 8 GB of RAM with use.data.table = TRUE or with use.data.table = FALSE. However, it did not work with stringsAsFactors = FALSE so keeping that on is advisable.

Handling big datasets scales pretty badly as "migr_asyappctzm" has only 80708892 (80 million) values whereas the biggest dataset in Eurostat "ef_lsk_main" has 148362539 (148 million) values and my computer with 16 GB of RAM was not able to handle the latter despite it having less than twice the amount of rows. Of course it might have more columns as well, I have not checked the data file contents that thoroughly.

Using packages that allow for partial loading of the datasets (such as sqLite and monetDB) would seem like the next logical step to take. See Laura DeCicco's blog post on US Geological Survey website for benchmarks and comparisons of different approaches: Working with pretty big data in R

pitkant commented 8 months ago

There is still work to be done with this issue in subsequent releases but this issue has been solved with a minimum viable solution as of now. For further developments we will open a new issue. Closed with the CRAN release of package version 4.0.0.