nickmckay / LiPD-utilities

Input/output and manipulation utilities for LiPD files in Matlab, R and Python
http://nickmckay.github.io/LiPD-utilities/
GNU General Public License v2.0
29 stars 9 forks source link

R: readLipd() fails to read a directory if number of files is large #83

Closed oliverbothe closed 3 years ago

oliverbothe commented 3 years ago

What: readLipd fails to read a directory with a large number of Lipd-files or with Lipd-files that include overall a large number of CSVs

Where: R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

lipdR_0.2.3 readr_2.0.1 vroom_1.5.4

Why: readLipd deep down uses readr::read_csv() and that in turn uses vroom::vroom(). Vroom has a flag whether it is supposed to use the Altrep-framework, which is available since R version 3.5? If I understand it correctly, using vroom with altrep all files opened by readr::read_csv() are just attached for future reference and their content is not actively included in R. This may then easily exceed the limits on file-descriptors for the system.

Potential solutions:

I am not sure whether other functions within the R part of the Lipdverse rely on vroom or on functions that in turn use vroom

Disclaimer: Everything here is to the best of my understanding - which may be wrong.

nickmckay commented 3 years ago

Thanks for this issue Oliver, I'll look into it. This is a new one. How many files does it take to trigger this issue? I routinely use this function to to load many hundreds of files, so it may be a linux-specific issue. I'd like to try to replicate on my end if possible.

Also, for a few reasons, I've moved active development of lipdR to nickmckay/lipdR as I move towards a CRAN submission. Changes on the main branch are minor so far, but this part of this repo will be archived and phased out in the near future.

oliverbothe commented 3 years ago

In my case it took about 1021 or so files (I am unclear how the file connections are actually accounted for) as the soft limit for my maximum number of open file descriptors is set to the default 1024. Changing those limits on one's machine should also solve the problem.

I had two test scenarios where the issue occurred. One of those was the full globalHolocene1_0_0 collection. Using that in a single readLipd() should reproduce the problem (if the user limits are comparable).

nickmckay commented 3 years ago

Hmm,

D <- readLipd("https://lipdverse.org/globalHolocene/current_version/globalHolocene1_0_0.zip") successfully loaded 1383 datasets for me on my Mac and our linux cluster.

It's certainly possible that it's an OS or user setting issue, however I'd be surprised if it's a number of open file connections issue, since readLipd loops through and opens them one at a time, and it should close them after each load. Can you post the error message that you're getting?

oliverbothe commented 3 years ago

Interesting, but for me it also fails with the URL. I had so far tried it with a downloaded copy.

Error messages varied dependent on what I did exactly. The one I get with the URL is:

 "reading: igelsjon..2003.lpd"                                                                                  
[1] "Error: lipd_read: Error in find_data_dir(): Error: Unable to find the 'data' directory in the LiPD file\n"
[1] "reading: iglutalik.Davis.1980.lpd"
[1] "Error: lipd_read: Error in find_data_dir(): Error: Unable to find the 'data' directory in the LiPD file\n"

and so on. When I end the session, I further get

Save workspace image? [y/n/c]: n
rm: Traversal fehlgeschlagen: /home/$name/Rtmp//RtmpSNiVDT/filefca350b3758: Zu viele offene Dateien

which means Traversal failed, too many open files and pointed me in the direction of my solution. In another case the error message differed but at ending the session I got sh: error while loading shared libraries: libc.so.6: cannot open shared object file: Error 24, which also "means too many open files".

As I said, it quite likely also hinges on the user limits.

Thus, I assume if you check ulimit -Sn you get a number larger than 1024?

And, what do you see when you check lsof -c R (or lsof -c rsession, in case you are in rstudio) while the sessions are still running? I see 1021 or so open files of which most have the status (deleted).

That is, yes, readLipd() loops through the files and through the CSVs but vroom() only attaches the connections and does not close them until D is deleted completely as long as it uses the altrep framework. Compare also this older comment on the vroom-github-repository.

nickmckay commented 3 years ago

OK, it seems like you are several steps ahead of me here. I'll try to dig in and find solutions. Interestingly, on my mac ulimit -Sn returns 256, but it happily loaded >1300 files. More soon

oliverbothe commented 3 years ago

Thanks. The proposed solution with passing an argument lazy=FALSE to read_csv worked for me, when I broke down the code.

I am surprised by the differences in behavior. Are your R-versions still <3.5?

nickmckay commented 3 years ago

OK, it looks like the issue arose with advent of readr 2.0.0. Earlier versions don't have this problem, perhaps because they didn't use vroom. I think this is fixed this in version 0.2.4, which is available at github.com/nickmckay/lipdR

nickmckay commented 3 years ago

Thanks for bringing this to my attention, providing the solution, and walking me through it!

oliverbothe commented 3 years ago

Thank you for the quick implementation. Looked good on a first test.