hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

segfault crash when using --sort opiton for dwalk #552

Closed markmoe19 closed 9 months ago

markmoe19 commented 10 months ago

I am using dwalk (v0.11.1) to walk ~150M files. This will crash if I use the "--sort option name" but runs to completion if I don't use --sort. See backtrace below. This is using the DTCMP that comes with mpifileutils v0.11.1.

It might be a combination of --sort and a larger number of files. I am looking into that. Also, the crash happens when using the .mfu file as input to create a sorted text output.

This looks like it will work well for us but the "--sort name" option is important for our reporting.

Thanks in advance for your help,

[2023-09-08T08:20:31] Walked 260638549 items in 3446.111 secs (75632.672 items/sec) ... [2023-09-08T08:20:32] Walked 260638549 items in 3446.560 seconds (75622.803 items/sec) [luna-0390:555052:0:555052] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f5266f50000) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0) ==== backtrace (tid: 555052) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018b9c9 __nss_database_lookup() ???:0 2 0x00000000000610a3 dtcmp_merge_local_2way_memcpy() ???:0 3 0x0000000000063ab1 dtcmp_sort_local_mergesort_scratch() dtcmp_sort_local_mergesort.c:0 4 0x0000000000063be0 DTCMP_Sort_local_mergesort() ???:0 5 0x000000000005b919 DTCMP_Sort_local() ???:0 6 0x00000000000678e9 DTCMP_Sortv_cheng_lwgrp() ???:0 7 0x0000000000067aba DTCMP_Sortv_cheng() ???:0 8 0x000000000005bcb6 DTCMP_Sortv() ???:0 9 0x000000000005beab DTCMP_Sortz() ???:0 10 0x0000000000040996 sort_files_stat() mfu_flist_sort.c:0 11 0x0000000000040bf0 mfu_flist_sort() ???:0 12 0x0000000000003e09 main() ???:0 13 0x0000000000024083 __libc_start_main() ???:0 14 0x00000000000026ee _start() ???:0

[luna-0390:555052] Process received signal [luna-0390:555052] Signal: Segmentation fault (11) [luna-0390:555052] Signal code: (-6) [luna-0390:555052] Failing at address: 0x8782c [luna-0390:555052] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5292b83090] [luna-0390:555052] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b9c9)[0x7f5292ccb9c9] [luna-0390:555052] [ 2] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(dtcmp_merge_local_2way_memcpy+0x128)[0x7f52930240a3] [luna-0390:555052] [ 3] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(+0x63ab1)[0x7f5293026ab1] [luna-0390:555052] [ 4] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sort_local_mergesort+0xf0)[0x7f5293026be0] [luna-0390:555052] [ 5] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sort_local+0xf8)[0x7f529301e919] [luna-0390:555052] [ 6] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv_cheng_lwgrp+0x1a1)[0x7f529302a8e9] [luna-0390:555052] [ 7] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv_cheng+0x7e)[0x7f529302aaba] [luna-0390:555052] [ 8] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv+0x1c2)[0x7f529301ecb6] [luna-0390:555052] [ 9] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortz+0x1db)[0x7f529301eeab] [luna-0390:555052] [10] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(+0x40996)[0x7f5293003996] [luna-0390:555052] [11] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(mfu_flist_sort+0xa6)[0x7f5293003bf0] [luna-0390:555052] [12] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/mpifileutils/src/dwalk/dwalk(main+0xb8f)[0x55d6b0292e09] [luna-0390:555052] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5292b64083] [luna-0390:555052] [14] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/mpifileutils/src/dwalk/dwalk(_start+0x2e)[0x55d6b02916ee] [luna-0390:555052] End of error message

adammoody commented 10 months ago

Thanks, @markmoe19 . I'd like to find and fix the underlying problem. It's not immediately clear as to what the cause is.

Does it fail for other sort options like --sort size, or is it unique to --sort name?

I see it's printing a stack trace at the point of the segfault. It would help to also include line numbers. Does it still fail if you build in debug mode -DCMAKE_BUILD_TYPE=Debug?

markmoe19 commented 10 months ago

debug.txt I was able to reproduce with crash with debug option. Looks like the crash is more like using 64 cpus across 2 nodes rather than 64 cpus on 1 node. See attached debug.txt file.

markmoe19 commented 10 months ago

Not sure if this matters or not, we have some files with \n and/or \r in the actual file name. dwalk seems to output that ok (with the \n causing a line-break as expected). So, that is probably not the issue in this case, but just wanted to mention the wild characters that might be in our filenames.

markmoe19 commented 10 months ago

@adammoody the crash does not happen with --sort size, only with --sort name as shown in debug.txt attachment above

adammoody commented 10 months ago

Thanks, @markmoe19 . The line numbers help clarify the problematic code path. I'll see if that's enough. I may come back to you and request adding some printf statements to get more debug info.

adammoody commented 9 months ago

I haven't spotted anything obvious in the code, and I can't get this to segfault in my testing so far.

I'm working up a branch of DTCMP with some printf statements in various spots to get more info. When you have a chance, I'd like to have you run with this debug build. I'll post some instructions on how to build with that next week.

adammoody commented 9 months ago

@markmoe19 , I suspect the problematic code is more likely to be in DTCMP. Before we take that step, can you reproduce the segfault after making the changes below to add a couple printf statements to sort_files_stat() in src/common/mfu_flist_sort.c of mpiFileUtils?

diff --git a/src/common/mfu_flist_sort.c b/src/common/mfu_flist_sort.c
index effb80a..1de69d2 100644
--- a/src/common/mfu_flist_sort.c
+++ b/src/common/mfu_flist_sort.c
@@ -265,6 +265,11 @@ static mfu_flist sort_files_stat(const char* sortfields, mfu_flist flist)
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &ranks);

+    uint64_t global_size = mfu_flist_global_size(flist);
+    printf("%d: local_size=%d global_size=%d chars=%d\n",
+        rank, (int)incount, (int)global_size, (int)chars);
+    fflush(stdout);
+
     /* build type for file path */
     MPI_Datatype dt_filepath, dt_user, dt_group;
     MPI_Type_contiguous((int)chars,       MPI_CHAR, &dt_filepath);
@@ -529,6 +534,10 @@ static mfu_flist sort_files_stat(const char* sortfields, mfu_flist flist)
         idx++;
     }

+    printf("%d: key_extent=%d, keysat_extent=%d, bufsize=%d exp=%d\n",
+        rank, (int)key_extent, (int)keysat_extent, (int)(sortptr - (char*)sortbuf), (int)(sortbufsize));
+    fflush(stdout);
+
     /* sort data */
     void* outsortbuf;
     int outsortcount;

With this, each rank should print a couple of messages in a dwalk --sort name. This is to help verify that the input buffer is sized correctly based on the list and MPI derived datatypes.

markmoe19 commented 9 months ago

snippet.txt

new crash output is attached. I happen to run with "--sort size" first and it did not crash (which is expected). The attached though is using "--sort name" which did cause the crash also as expected. Debug mode was enabled and your extra printf commands were added. Thanks.

adammoody commented 9 months ago

Ok, thanks. That all looks reasonable, and in fact, I think it provided a great clue.

I noticed that it's printing some negative values for the size of the buffer. That's because I mistakenly used an int datatype in the debug printf statements. However, that also pointed out that you are using some large input buffers and that DTCMP might also have an overflow bug. That indeed looks to be the case:

https://github.com/LLNL/dtcmp/blob/dfd514b04f9b7fd492aea8a2f8db811a4b314f00/src/dtcmp_merge_2way.c#L47-L53

Are you installing DTCMP by hand or using another method like Spack?

If you are installing by hand, can you edit src/dtcmp_merge_2way.c to replace the int in these two int remainder = ... lines with size_t types, rebuild DTCMP, and try the dwalk --sort again with the modified DTCMP library?

If you are not yet installing by hand, I can provide some instructions on how to do that.

BTW, I've optimistically got a PR ready to go: https://github.com/LLNL/dtcmp/pull/17

markmoe19 commented 9 months ago

I'm using build instructions from https://mpifileutils.readthedocs.io/en/v0.11.1/build.html

dtcmp is included from "wget https://github.com/hpc/mpifileutils/releases/download/v0.11.1/mpifileutils-v0.11.1.tgz" and expands in the folder at mpifileutils-v0.11.1/dtcmp

Just to be sure, are you saying that in the file dtcmp_merge_2way.c, I need to replace "int remainder" with "size_t remainder"? Thanks

adammoody commented 9 months ago

Ok, good. That distribution builds DTCMP and mpiFileUtils all in one shot, so that simplifies things.

Yes, you got it. Go ahead and make those two int --> size_t changes in dtcmp_meger_2way.c and rebuild.

In the meantime, since I now have a better idea of the data sizes involved, I'll try again to reproduce the segfault here.

adammoody commented 9 months ago

It took some trial and error to find a configuration that used enough memory without using so much as to OOM, but I was able to reproduce the segfault (with int) and then verify that the DTCMP fix (with size_t) resolves it in my case. I went ahead and merged https://github.com/LLNL/dtcmp/pull/17 into DTCMP, which will be packaged with the next mpiFileUtils release.

I'd still like to know whether the fix works for you, especially since you could use it as a work around until the next release is stamped.

markmoe19 commented 9 months ago

I can confirm the size_t resolves the --sort name issue for me! Thanks! snippet of output below and it takes 50 some minutes and a lot of RAM to complete the sort It is 540M files and many have really long paths. snippet.txt

markmoe19 commented 9 months ago

1.8TB each on 2 nodes when I sort the data by name! Each node has 2.0TB RAM, so it just fits. When I don't sort the data, these jobs typically take 266GB RAM on 1 node.

adammoody commented 9 months ago

Great! Glad that we figured that out.

I'm sure the sort operation in DTCMP could be optimized further -- DTCMP is not intentionally slow, but it was written more for functionality than performance. For one, I think it's doing a bunch of intermediate string copies using the current algorithm. It would probably help to modify the elements to record the pointer to the string rather than a copy of the string itself. The strings could then be rearranged once at the end after fully sorting.

Having said that, it is using a parallel sort. If you have access to more resources, it should run faster by using more procs/nodes.

You can go ahead and drop those debug printf statements we added. I don't think we need those any longer.

adammoody commented 9 months ago

And I think you've already mentioned doing this, but for testing, you can break the walk and sort into two steps:

srun -n64 -N2 dwalk --output unsorted.mfu /path/to/walk
srun -n256 -N8 dwalk --input unsorted.mfu --sort name --output sorted.mfu

This lets you try different sort configurations without having to walk again.

markmoe19 commented 9 months ago

Right, I normally do split the dwalk for mfu file genertation separate from dwalk to generate text file from the mfu file. I keep the mfu file for 7 days back and rotate them out after that. Useful for future, faster dwalk and dfind runs, thanks!

markmoe19 commented 9 months ago

It scales well, 4 nodes takes about half the time.

2nodes, 32proc per node = 540M files walked in 7282s, sorted in 3067s, wrote to text output file in 62s 4nodes, 32proc per node = 542M files walked in 3755s, sorted in 1246s, wrote to text output file in 39s

The different total file count is just yesterday versus today.

adammoody commented 9 months ago

Ok, looks good. Thanks for sharing the performance numbers. That's quite the set of files to be working with.

I'll go ahead and close this issue out as being resolved by https://github.com/LLNL/dtcmp/pull/17, which will be included in the upcoming v0.12 release of mpiFileUtils.

Thanks again, @markmoe19 , for reporting this issue and for taking the time to work through it with me!

markmoe19 commented 9 months ago

Thanks for the fixes! mpifileutils really helps us quickly manage very large amounts of data!