Better performance with big datasets

pitkant commented 11 months ago

As mentioned in #98 big datasets can cause R to hang / crash. I also myself experienced this when running tests on weekly data (see issue #200 ), which is obviously bigger than monthly/yearly data.

I have been tested replacing some old functions with data.table based implementations on migr_asyappctzm dataset and got the following promising benchmarks (best case scenarios in bold):

get_eurostat_raw

readr::read_tsv: about 1 minute 20 seconds
data.table::fread about 11 seconds

tidy_eurostat

tidyr::pivot_longer, dplyr::filter, gsub... 1 min 23 sec
data.table melt, na.omit, stringi::stri_replace_all_fixed... 45 sec

saveRDS

compress = TRUE, stringsAsFactors = FALSE 1 min 40 sec
compress = TRUE, stringsAsFactors = TRUE 24 sec (cache file size 15,6 Mb)
compress = FALSE, stringsAsFactors = FALSE 1 min 15 sec (cache file size 7,61 Gb)
compress =FALSE, stringsAsFactors = TRUE 8 sec (cache file size 3,54 Gb)

So in the worst case scenario the whole processing took 4 min 20 sec whereas in the best case scenario it took 1 min 20 sec. Some seconds could be shaved off by not compressing the cache file or by turning cache off but I don't think it's worth it. The added benefit of using data.table and its := operations was that no copies of the object in memory are made as operations are made in place:

A shallow copy is just a copy of the vector of column pointers (corresponding to the columns in a data.frame or data.table). The actual data is not physically copied in memory.

A deep copy on the other hand copies the entire data to another location in memory.

With data.table’s := operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because := operator updates data.table columns in-place (by reference).

I think there might still be some quirks in utilising data.table as opposed to the old method so I've made data.table code optional for now. Feedback and suggestions very much welcome in this issue or the forthcoming PR I will link to this issue.

pitkant commented 11 months ago

For even better performance with big datasets I guess the option would be to use packages like ff (see bigdata2018 course materials, example of loading big kaggle data, r-bloggers blog post) or bigmemory (see bibmemory overview vignette from 2010) or some database format, such as RSQLite (see RSQLite vignette). I would personally prefer using a database approach.

An experimental branch could be created for testing this approach but I would not include these features in 4.0.0.

pitkant commented 11 months ago

Big datasets with inadequate amounts of RAM seem to be quite torturous to the system. Tested with a cloud-based Windows 10 Enterprise with Intel Xeon 6242R and 6 GB of allocated RAM downloading and handling aforementioned dataset without data.table failed after 20 minutes to an error "cannot allocate vector size of 614.9 Mb". It was actually surprising that the process was able to go on as long as it did as it seemed to include quite a lot of swapping on HDD, reads and writes were pretty high when running the function. When attempting to use data.table functions it crashed R as well, but it crashed a lot quicker so maybe it wastes less of your time...

Some potential improvements:

Should users be warned if the dataset gzip file is larger than some pre-set threshold? What would that threshold be?
Should this warning include a "Do you wish to continue?" style prompt?
Should we have the option to handle the dataset in chunks?

Some more general notes on the problems I encountered while working with a Virtual Machine / Windows:

RTools is required for running CMD commands on Windows. This is necessary for example for gzip. Maybe this requirement could be circumvented somehow.
I attempted to use my home folder as the working directory but on the virtual machine it was in the form of "\path.to.somewhere\username\R_files" which produced an error message "UNC paths are not supported". I used the system Downloads folder as a temporary working folder.

Session info

``` R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=Finnish_Finland.utf8 [2] LC_CTYPE=Finnish_Finland.utf8 [3] LC_MONETARY=Finnish_Finland.utf8 [4] LC_NUMERIC=C [5] LC_TIME=Finnish_Finland.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] eurostat_4.0.0.9005 loaded via a namespace (and not attached): [1] tidyselect_1.2.0 httr2_0.2.3 remotes_2.4.2.1 [4] purrr_1.0.2 generics_0.1.3 vctrs_0.6.3 [7] utf8_1.2.3 rlang_1.1.1 pkgbuild_1.3.1 [10] e1071_1.7-13 pillar_1.9.0 glue_1.6.2 [13] withr_2.5.1 rappdirs_0.3.3 bit64_4.0.5 [16] readxl_1.4.3 lifecycle_1.0.3 plyr_1.8.8 [19] stringr_1.5.0 cellranger_1.1.0 callr_3.7.1 [22] tzdb_0.4.0 ps_1.7.1 parallel_4.2.1 [25] curl_4.3.2 class_7.3-20 fansi_1.0.4 [28] Rcpp_1.0.11 KernSmooth_2.23-20 readr_2.1.4 [31] backports_1.4.1 classInt_0.4-10 vroom_1.6.3 [34] jsonlite_1.8.7 bit_4.0.5 countrycode_1.5.0 [37] hms_1.1.3 digest_0.6.33 stringi_1.7.12 [40] processx_3.7.0 dplyr_1.1.3 rprojroot_2.0.3 [43] bibtex_0.5.1 here_1.0.1 ISOweek_0.6-2 [46] cli_3.6.1 tools_4.2.1 magrittr_2.0.3 [49] proxy_0.4-27 tibble_3.2.1 RefManageR_1.4.0 [52] tidyr_1.3.0 crayon_1.5.2 pkgconfig_2.0.3 [55] regions_0.1.8 data.table_1.14.8 xml2_1.3.5 [58] prettyunits_1.2.0 timechange_0.2.0 lubridate_1.9.3 [61] assertthat_0.2.1 httr_1.4.7 rstudioapi_0.13 [64] R6_2.5.1 compiler_4.2.1 ```

pitkant commented 11 months ago

Me and @ake123 ran tests with migr_asyappctzm dataset and were able to handle it (albeit slowly) even on a Mac with 8 GB of RAM with use.data.table = TRUE or with use.data.table = FALSE. However, it did not work with stringsAsFactors = FALSE so keeping that on is advisable.

Handling big datasets scales pretty badly as "migr_asyappctzm" has only 80708892 (80 million) values whereas the biggest dataset in Eurostat "ef_lsk_main" has 148362539 (148 million) values and my computer with 16 GB of RAM was not able to handle the latter despite it having less than twice the amount of rows. Of course it might have more columns as well, I have not checked the data file contents that thoroughly.

Using packages that allow for partial loading of the datasets (such as sqLite and monetDB) would seem like the next logical step to take. See Laura DeCicco's blog post on US Geological Survey website for benchmarks and comparisons of different approaches: Working with pretty big data in R

pitkant commented 8 months ago

There is still work to be done with this issue in subsequent releases but this issue has been solved with a minimum viable solution as of now. For further developments we will open a new issue. Closed with the CRAN release of package version 4.0.0.

rOpenGov / eurostat