r-lib / fs

Provide cross platform file operations based on libuv.
https://fs.r-lib.org/
Other
365 stars 80 forks source link

dir_ls crashes R 4.2.1 when reading very large directories #447

Open arthurgailes opened 6 months ago

arthurgailes commented 6 months ago

Hello,

The following command crashes when reading through a directory with dozens of folders and over a 600k total files. I'm not sure what the inflection point is, but it works fine in similar directory with 100k total files. This problem does not occur in R 4.3.1.

fs::dir_ls(mydir, recurse = T)

gaborcsardi commented 6 months ago

Can you show the output, and also the stack trace after the crash?

arthurgailes commented 6 months ago

well I can't do a traceback because it crashes, here's the Rterm output

& Rterm --no-save --no-restore --verbose -e "fs::dir_ls('my/dir/path', recurse = TRUE)" 'verbose' and 'quietly' are both true; being verbose then .. now dyn.load("W:/R/R-4.2.1/library/methods/libs/x64/methods.dll") ...

R version 4.2.1 (2022-06-23 ucrt) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

'verbose' and 'quietly' are both true; being verbose then .. 'verbose' and 'quietly' are both true; being verbose then .. Garbage collection 1 = 0+0+1 (level 2) ... 12.1 Mbytes of cons cells used (35%) 2.8 Mbytes of vectors used (4%) now dyn.load("W:/R/R-4.2.1/library/utils/libs/x64/utils.dll") ... 'verbose' and 'quietly' are both true; being verbose then .. now dyn.load("W:/R/R-4.2.1/library/grDevices/libs/x64/grDevices.dll") ... 'verbose' and 'quietly' are both true; being verbose then .. now dyn.load("W:/R/R-4.2.1/library/graphics/libs/x64/graphics.dll") ... 'verbose' and 'quietly' are both true; being verbose then .. now dyn.load("W:/R/R-4.2.1/library/stats/libs/x64/stats.dll") ... ending setup_Rmainloop(): R_Interactive = 0 {main.c}

R_ReplConsole(): before "for(;;)" {main.c} fs::dir_ls('my/dir/path', recurse = TRUE) now dyn.load("W:/R/R-4.2.1/library/fs/libs/x64/fs.dll") ... Garbage collection 2 = 1+0+1 (level 0) ... 15.4 Mbytes of cons cells used (45%) 3.5 Mbytes of vectors used (5%) Garbage collection 3 = 2+0+1 (level 0) ... 17.7 Mbytes of cons cells used (51%) 6.5 Mbytes of vectors used (10%) Garbage collection 4 = 3+0+1 (level 0) ... 21.4 Mbytes of cons cells used (62%) 11.5 Mbytes of vectors used (18%) Garbage collection 5 = 4+0+1 (level 0) ... 24.3 Mbytes of cons cells used (71%) 16.0 Mbytes of vectors used (25%) Garbage collection 6 = 5+0+1 (level 0) ... 26.5 Mbytes of cons cells used (77%) 18.7 Mbytes of vectors used (29%) Garbage collection 7 = 6+0+1 (level 0) ... 28.3 Mbytes of cons cells used (82%) 20.8 Mbytes of vectors used (32%) Garbage collection 8 = 6+1+1 (level 1) ... 29.6 Mbytes of cons cells used (86%) 22.7 Mbytes of vectors used (35%) Garbage collection 9 = 6+1+2 (level 2) ... 30.6 Mbytes of cons cells used (44%) 24.0 Mbytes of vectors used (37%) Garbage collection 10 = 7+1+2 (level 0) ... 39.4 Mbytes of cons cells used (56%) 34.7 Mbytes of vectors used (54%) Garbage collection 11 = 8+1+2 (level 0) ... 46.3 Mbytes of cons cells used (66%) 48.0 Mbytes of vectors used (75%) Garbage collection 12 = 9+1+2 (level 0) ... 51.6 Mbytes of cons cells used (73%) 54.5 Mbytes of vectors used (85%) Garbage collection 13 = 9+2+2 (level 1) ... 55.8 Mbytes of cons cells used (79%) 59.5 Mbytes of vectors used (93%) Garbage collection 14 = 9+2+3 (level 2) ... 59.0 Mbytes of cons cells used (47%) 60.9 Mbytes of vectors used (68%) Garbage collection 15 = 10+2+3 (level 0) ... 73.6 Mbytes of cons cells used (59%) 88.7 Mbytes of vectors used (100%) Garbage collection 16 = 10+2+4 (level 2) ... 73.9 Mbytes of cons cells used (59%) 84.1 Mbytes of vectors used (72%) Garbage collection 17 = 11+2+4 (level 0) ... 85.2 Mbytes of cons cells used (68%) 97.8 Mbytes of vectors used (84%) Garbage collection 18 = 11+3+4 (level 1) ... 94.0 Mbytes of cons cells used (75%) 108.4 Mbytes of vectors used (93%) Garbage collection 19 = 11+3+5 (level 2) ... 100.8 Mbytes of cons cells used (49%) 116.7 Mbytes of vectors used (75%) Garbage collection 20 = 12+3+5 (level 0) ... 123.7 Mbytes of cons cells used (61%) 148.5 Mbytes of vectors used (95%) Garbage collection 21 = 12+3+6 (level 2) ... 15.4 Mbytes of cons cells used (9%) 7.0 Mbytes of vectors used (6%)

gaborcsardi commented 6 months ago

Sorry, I meant a stack trace from a low lever debugger, like gdb or drmingw.

mdsumner commented 3 months ago

We have a similar problem, it's been occurring since February 2024 but was not detected because the output wasn't so important and we've been trying to find out why it's failing for a few days.

My example, this is on R 4.4.0, fs_1.6.4 and we have had this issue on a machine running R 4.3.2 (and failing since February afaict).

 f <- fs::dir_ls("/rdsi/PUBLIC/raad/data", recurse = TRUE, type = c("file", "symlink"))

 *** caught segfault ***
address 0x7f376fc11010, cause 'memory not mapped'

Traceback:
 1: dir_map(old, identity, all, recurse, type, fail)
 2: fs::dir_ls("/rdsi/PUBLIC/raad/data", recurse = TRUE, type = c("file",     "symlink"))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

The number of results is 1,189,442 (determined with 'find -type f'. The largest number of files in one dir is 20030, and the total size of files is ~33Tb. The text output of find as a text file is 134Mb.

This is an NFS mount. This listing via fs::dir_ls() has been working on this mount for several years, with this path having >1e6 files for at least two years.

There is another path that works the same way prior to the segfaulting one, with a total file number ~225000 that has always worked.

I'm working on a gdb analysis with wch/r-debug 🙏

raymondben commented 3 months ago

@mdsumner with fs v1.6.3 d <- fs::dir_ls(recurse = TRUE) from my home directory crashes when run in RStudio, but runs fine from R running in a terminal console. Updating to latest github version of fs (1.6.4.9000) fixed it in both.

mdsumner commented 3 months ago

Excellent, that fixes my problem.

raymondben commented 2 months ago

@gaborcsardi this issue has popped up again. It's happening consistently with a particular directory (same as the one @mdsumner gave details on above).

The specific command that causes the error is:

> f <- dir_ls("/rdsi/PUBLIC/raad/data", recurse = TRUE)

 *** caught segfault ***
address 0x7f1a44f59010, cause 'memory not mapped'

Traceback:
 1: dir_map(old, identity, all, recurse, type, fail)
 2: dir_ls("/rdsi/PUBLIC/raad/data", recursive = T)

If I run the whole script via Rscript, it gives a slightly different error:

Error in fs::dir_ls(roots[i], all = TRUE, recurse = TRUE, type = c("file",  :
  Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'NULL'
Calls: <Anonymous> -> <Anonymous>
Execution halted

(but I think that's a red herring: if I run the actual dir_ls command through Rscript I get the first segfault/memory not mapped message again)

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] fs_1.6.4.9000

loaded via a namespace (and not attached):
[1] compiler_4.4.1

(same error also happens on a machine running R 4.3, and mounting the same directory over sshfs, not as a direct nfs mount).

The core dump - I realize that this is from a standard R binary, not R-debug, but throwing it in with the hope that it might be useful:

$ gdb /usr/lib/R/bin/exec/R /var/lib/apport/coredump/core._usr_lib_R_bin_exec_R.1000.e1a118ec-6c6a-46cd-a936-12d14c723d41.99330.42401020
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/R/bin/exec/R...
(No debugging symbols found in /usr/lib/R/bin/exec/R)
[New LWP 99330]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/R/bin/exec/R'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140348209796096) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140348209796096) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=140348209796096) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140348209796096, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007fa55d34c476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  0x00007fa55d5d7de2 in Rf_getAttrib () from /usr/lib/R/lib/libR.so
#6  0x00007fa55d5ecec1 in Rf_xlengthgets () from /usr/lib/R/lib/libR.so
#7  0x00007fa559f9e073 in CollectorList::push_back (x=0x55ae21298270, this=<optimized out>)
    at /tmp/RtmpnG1MnF/R.INSTALL166cd1e1f0cbd/fs/src/CollectorList.h:19
#8  dir_map (fun=0x55ae159a8068,
    path=0x55ae15552600 "/rdsi/PUBLIC/raad/data/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/2023.02.09", all=true,
    file_type=-1, recurse=2147483643, value=<optimized out>, fail=true) at dir.cc:116
#9  0x00007fa559f9daef in dir_map (fun=0x55ae159a8068,
    path=0x55ae1de1c1e0 "/rdsi/PUBLIC/raad/data/n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002", all=true, file_type=-1,
    recurse=2147483644, value=<optimized out>, fail=true) at /usr/include/c++/11/bits/basic_string.h:194
#10 0x00007fa559f9daef in dir_map (fun=0x55ae159a8068,
    path=0x55ae16ad9250 "/rdsi/PUBLIC/raad/data/n5eil01u.ecs.nsidc.org/PM", all=true, file_type=-1,
    recurse=2147483645, value=<optimized out>, fail=true) at /usr/include/c++/11/bits/basic_string.h:194
#11 0x00007fa559f9daef in dir_map (fun=0x55ae159a8068,
    path=0x55ae1dec5d40 "/rdsi/PUBLIC/raad/data/n5eil01u.ecs.nsidc.org", all=true, file_type=-1, recurse=2147483646,
    value=<optimized out>, fail=true) at /usr/include/c++/11/bits/basic_string.h:194
#12 0x00007fa559f9daef in dir_map (fun=0x55ae159a8068, path=0x55ae17439378 "/rdsi/PUBLIC/raad/data", all=true,
    file_type=-1, recurse=2147483647, value=<optimized out>, fail=true) at /usr/include/c++/11/bits/basic_string.h:194
#13 0x00007fa559f9e40d in fs_dir_map_ (path_sxp=0x55ae17186ac0, fun_sxp=0x55ae159a8068, all_sxp=0x55ae154e1c28,
    type_sxp=0x55ae171869e0, recurse_sxp=0x55ae17186970, fail_sxp=0x55ae17177b60) at dir.cc:148
#14 0x00007fa55d63727a in ?? () from /usr/lib/R/lib/libR.so
#15 0x00007fa55d67a688 in ?? () from /usr/lib/R/lib/libR.so
#16 0x00007fa55d68e19d in ?? () from /usr/lib/R/lib/libR.so
#17 0x00007fa55d68e50b in Rf_eval () from /usr/lib/R/lib/libR.so
#18 0x00007fa55d6906df in ?? () from /usr/lib/R/lib/libR.so
#19 0x00007fa55d6914c7 in ?? () from /usr/lib/R/lib/libR.so
#20 0x00007fa55d68e63c in Rf_eval () from /usr/lib/R/lib/libR.so
#21 0x00007fa55d693752 in ?? () from /usr/lib/R/lib/libR.so
#22 0x00007fa55d68e936 in Rf_eval () from /usr/lib/R/lib/libR.so
#23 0x00007fa55d6c4cea in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#24 0x00007fa55d6c5080 in ?? () from /usr/lib/R/lib/libR.so
#25 0x00007fa55d6c5140 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#26 0x000055ae13eb409f in main ()
#27 0x00007fa55d333d90 in __libc_start_call_main (main=main@entry=0x55ae13eb4080 <main>, argc=argc@entry=1,
--Type <RET> for more, q to quit, c to continue without paging--