mkdwarfs aborted with SIGBUS after around 13 hours of runtime

mhx / dwarfs

A fast high compression read-only file system for Linux, Windows and macOS

GNU General Public License v3.0

2.11k stars 56 forks source link

mkdwarfs aborted with SIGBUS after around 13 hours of runtime #45

Closed ghost closed 3 years ago

ghost commented 3 years ago

Here's the log:

nabla@satella /media/veracrypt1/squash $ mkdwarfs -i /media/veracrypt1/squash/mp/ -o "/run/media/nabla/General Store/TEMP/everything.dwarfs"
I 17:46:07.266160 scanning /media/veracrypt1/squash/mp/
E 18:14:36.699276 error reading entry: readlink('/media/veracrypt1/squash/mp//raid0array0-2tb-2018.sqsh/Program Files (x86)/Internet Explorer/ExtExport.exe'): Invalid argument
I 19:27:28.763515 assigning directory and link inodes...
I 19:27:29.319281 waiting for background scanners...
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
scanning: /media/veracrypt1/squash/mp//pucktop-echidna-dec2020.sqsh/.local/share/Steam/steamapps/common/Half-Life 2/hl2/bin/server.so
694746 dirs, 299340/1488 soft/hard links, 1254591/5749940 files, 0 other
original size: 1.352 TiB, dedupe: 200.6 GiB (364325 files), segment: 0 B
filesystem: 0 B in 0 blocks (0 chunks, 888778/5384127 inodes)
compressed filesystem: 0 blocks/0 B written
▏                                                                                                                                ▏  0% /
*** Aborted at 1619766236 (Unix time, try 'date -d @1619766236') ***
*** Signal 7 (SIGBUS) (0x7fe8f3af8000) received by PID 15018 (pthread TID 0x7fe94c3e8640) (linux TID 15042) (code: nonexistent physical address), stack trace: ***
/usr/lib64/libfolly.so.0.58.0-dev(+0x2b64bf)[0x7fe9599e54bf]
/usr/lib64/libfolly.so.0.58.0-dev(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x31)[0x7fe959924471]
/usr/lib64/libfolly.so.0.58.0-dev(+0x1f6112)[0x7fe959925112]
/lib64/libc.so.6(+0x396cf)[0x7fe9592286cf]
/usr/lib64/libxxhash.so.0(XXH3_64bits_update+0x774)[0x7fe958c6d584]
/usr/lib64/libdwarfs.so(+0x788cd)[0x7fe959e4f8cd]
/usr/lib64/libdwarfs.so(_ZN6dwarfs4file4scanERKSt10shared_ptrINS_4mmifEERNS_8progressE+0x95)[0x7fe959e5b525]
/usr/lib64/libdwarfs.so(+0xe9a89)[0x7fe959ec0a89]
/usr/lib64/libdwarfs.so(+0xf7f6b)[0x7fe959ecef6b]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(+0xd315f)[0x7fe95949f15f]
/lib64/libpthread.so.0(+0x7fbd)[0x7fe959142fbd]
/lib64/libc.so.6(clone+0x3e)[0x7fe9592ee26e]
(safe mode, symbolizer not available)
Bus error

I left this running while trying to compress over 8 TiB of data, and after about 13 hours of scanning, it just sorta crashed and gave up. I don't really want to run it again to debug it or anything, so I'm just going to leave this here.

Running Gentoo Linux on a Ryzen 5 3600 with 64 GB of memory, if that helps.

Sorry about the lack of information. I'd really like to provide more - and if there's anything you'd like me to try to resolve this, let me know (I really like dwarfs, and was hoping it would work for this obscenely large dataset too!) Just uhhh.. keep in mind that I'm prooobably not going to wait 13 hours again unless I know it works :/

EDIT: Forgot to specify my version number. Whoops. I'm using 0.5.4-rc2 from the GURU repositoriy here https://github.com/gentoo/guru/blob/master/sys-fs/dwarfs/dwarfs-0.5.4-r2.ebuild - I did build with -O3, but dwarfs seems to work just fine with smaller inputs so I dunno. Specifically I'm using this: https://github.com/InBetweenNames/gentooLTO

ghost commented 3 years ago

Okay, I tested mkdwarfs on another dataset around 220 GiB large - it worked just fine. Perhaps I just fed dwarfs enough data to cause an integer overflow or something with the 8 TiB+ thing... or maybe I just had it running long enough for a cosmic ray to corrupt memory or something. Regardless, here's the successfully completed log for the 220 GiB dataset, if it helps:

nabla@satella /media/veracrypt1/squash $ mkdwarfs -i "/run/media/nabla/General Store/TEMP/" -o "/run/media/nabla/General Store/minceraftserverthing.dwarfs" -l9
I 16:47:33.034100 scanning /run/media/nabla/General Store/TEMP/
I 16:47:33.496939 assigning directory and link inodes...
I 16:47:33.524731 waiting for background scanners...
I 18:21:04.351393 scanning CPU time: 1163s
I 18:21:04.351453 finalizing file inodes...
I 18:21:04.439711 saved 114.3 GiB / 219.4 GiB in 73539/157486 duplicate files
I 18:21:04.440218 assigning device inodes...
I 18:21:04.442932 assigning pipe/socket inodes...
I 18:21:04.445074 building metadata...
I 18:21:04.445118 building blocks...
I 18:21:04.445126 saving names and symlinks...
I 18:21:04.445940 using a 4 KiB window at 256 B steps for segment analysis
I 18:21:04.445980 bloom filter size: 512 KiB
I 18:21:04.446369 ordering 83947 inodes using nilsimsa similarity...
I 18:21:04.450168 nilsimsa: depth=20000 (1000), limit=255
I 18:21:04.493646 updating name and link indices...
I 18:21:04.556433 pre-sorted index (660462 name, 2118 path lookups) [105.7ms]
I 18:21:04.651090 83947 inodes ordered [204.7ms, 202.3ms CPU]
I 18:21:04.651144 waiting for segmenting/blockifying to finish...
I 19:21:07.901863 segmenting/blockifying CPU time: 1594s
I 19:21:07.901944 bloom filter reject rate: 89.797% (TPR=0.167%, lookups=109509201638)
I 19:21:07.901976 segmentation matches: good=32324, bad=18851263, total=18993636
I 19:21:07.902009 segmentation collisions: L1=0.440%, L2=0.212% [429002344 hashes]
I 19:21:07.902580 saving chunks...
I 19:21:07.942344 saving directories...
I 19:21:07.987058 saving shared files table...
I 19:21:08.011199 saving names table... [13.83ms]
I 19:21:08.011945 saving symlinks table... [424.9us]
I 19:21:08.127737 waiting for compression to finish...
I 19:21:18.980702 compressed 219.4 GiB to 50.78 GiB (ratio=0.231472)
I 19:21:19.030169 compression CPU time: 2.522e+04s
I 19:21:19.030253 filesystem created without errors [9226s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
waiting for block compression to finish
133 dirs, 0/0 soft/hard links, 157486/157486 files, 0 other
original size: 219.4 GiB, dedupe: 114.3 GiB (73539 files), segment: 2.77 GiB
filesystem: 102.3 GiB in 1637 blocks (144112 chunks, 83947/83947 inodes)
compressed filesystem: 1637 blocks/50.78 GiB written [depth: 20000]
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏100% -

ghost commented 3 years ago

Aaaand I tried it on a larger dataset and it crashed again. This uhh.. isn't particularly stable or predictable... this time it's complaining about LZMA, for some reason.

nabla@satella /media/veracrypt1/squash $  mdwarfs -i "/run/media/nabla/General Store/TEMP/" -o "/run/media/nabla/General Store/feb2020vms-loose-cmpother-vms-apr2021.dwarfs" -l9
I 08:02:00.845967 scanning /run/media/nabla/General Store/TEMP/
I 08:04:49.002359 assigning directory and link inodes...
I 08:04:49.073915 waiting for background scanners...
I 12:34:42.457547 scanning CPU time: 5943s
I 12:34:42.458132 finalizing file inodes...
I 12:34:43.333456 saved 38.44 GiB / 607.9 GiB in 521740/720692 duplicate files
I 12:34:43.334005 assigning device inodes...
I 12:34:43.351719 assigning pipe/socket inodes...
I 12:34:43.369103 building metadata...
I 12:34:43.369193 building blocks...
I 12:34:43.369217 saving names and symlinks...
I 12:34:43.369366 using a 4 KiB window at 256 B steps for segment analysis
I 12:34:43.369716 bloom filter size: 512 KiB
I 12:34:43.370459 ordering 198952 inodes using nilsimsa similarity...
I 12:34:43.375888 nilsimsa: depth=20000 (1000), limit=255
I 12:34:43.584667 pre-sorted index (742793 name, 26468 path lookups) [202.2ms]
I 12:34:43.593826 updating name and link indices...
I 12:36:04.472356 198952 inodes ordered [81.1s, 58.12s CPU]
I 12:36:04.472416 waiting for segmenting/blockifying to finish...
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
writing: /run/media/nabla/General Store/TEMP//Feb2020VMs/vma/vzdump-qemu-120-2020_02_07-19_33_51.vma
181020 dirs, 80616/0 soft/hard links, 720692/720692 files, 0 other
original size: 607.9 GiB, dedupe: 38.44 GiB (521740 files), segment: 7.69 GiB
filesystem: 66.11 GiB in 1057 blocks (2610253 chunks, 1/198952 inodes)
compressed filesystem: 1042 blocks/43.5 GiB written [depth: 20000]
████████████████████████████████████████████████████████████████████████▊                                                        ▏ 56% /
terminate called after throwing an instance of 'dwarfs::runtime_error'
  what():  LZMA: unknown error 0
*** Aborted at 1619871614 (Unix time, try 'date -d @1619871614') ***
*** Signal 6 (SIGABRT) (0x3e800006f0f) received by PID 28431 (pthread TID 0x7f4492ff6640) (linux TID 28441) (maybe from PID 28431, UID 1000) (code: -6), stack trace: ***
/usr/lib64/libfolly.so.0.58.0-dev(+0x2b64bf)[0x7f449960b4bf]
/usr/lib64/libfolly.so.0.58.0-dev(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x31)[0x7f449954a471]
/usr/lib64/libfolly.so.0.58.0-dev(+0x1f6112)[0x7f449954b112]
/lib64/libc.so.6(+0x396cf)[0x7f4498e4e6cf]
/lib64/libc.so.6(gsignal+0x141)[0x7f4498e4e651]
/lib64/libc.so.6(abort+0x111)[0x7f4498e37537]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(+0x9a8a5)[0x7f449908c8a5]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(+0xa6977)[0x7f4499098977]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(_ZSt9terminatev+0x12)[0x7f44990989e2]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(__cxa_throw+0x42)[0x7f4499098c82]
/usr/lib64/libdwarfs.so(+0x3ae42)[0x7f4499a37e42]
/usr/lib64/libdwarfs.so(_ZNK6dwarfs21lzma_block_compressor8compressERKSt6vectorIhSaIhEE+0x33)[0x7f4499a651c3]
/usr/lib64/libdwarfs.so(+0x938d7)[0x7f4499a908d7]
/usr/lib64/libdwarfs.so(+0xf7f6b)[0x7f4499af4f6b]
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/libstdc++.so.6(+0xd315f)[0x7f44990c515f]
/lib64/libpthread.so.0(+0x7fbd)[0x7f4498d68fbd]
/lib64/libc.so.6(clone+0x3e)[0x7f4498f1426e]
(safe mode, symbolizer not available)
Aborted

It did get past the scanning phase this time, though.

mhx commented 3 years ago

Hi and thanks for your report!

It's entirely possible that you've hit some limit. I don't think I've ever tried input datasets of more than a few hundred GiB. That being said, it looks like you're still a few orders of magnitude below the first limit that I'd expect anyone to hit at some point.

The stack traces do look a little odd, tbh. The first stack trace seems to originate out of xxHash, the second, as you've pointed out, out of the LZMA code. Quite honestly, I've no idea what could be causing these.

Could you do me a favour and try this using the binaries from the release page? Also, when you try this with the released binaries, could you enable core dumps (ulimit -S -c unlimited)? The core files should hopefully make these crashes easier to debug.

ghost commented 3 years ago

Alright, I'll give this a go and let you know what happens. Will try this on the last dataset I described as failing, with the same input options. (This is an absolutely gigantic amount of data so this might take a while haha)

By the way, is there any way I can send you the core dumps privately? I have no idea how much identifying information they'll contain - particularly of the data I'm compressing...

mhx commented 3 years ago

By the way, is there any way I can send you the core dumps privately? I have no idea how much identifying information they'll contain - particularly of the data I'm compressing...

Depends on the size of the (compressed) dump, which I'd imagine to be quite large. Can you put it somewhere I can download it and send me a link via dwarfs(at)mhxnet(dot)de?

ghost commented 3 years ago

Yeah that's fine, I'll leave this running overnight and if it crashes again I'll upload the compressed dump to MEGA or something for you and email you the link. It's still scanning right now and hasn't crashed yet. If this run doesn't crash and finishes successfully in the morning, I'll re-run it with the binary I built through portage and send you that core dump - if that happens then it might genuinely be some bug caused by building with -O3 or something, although those are pretty rare in my experience.

mhx commented 3 years ago

WRT -O3, gcc actually produces much worse code, in particular for mkdwarfs, with -O3 compared to -O2. See #14. I've tried to work around the problem by forcing -O2 when building with gcc, not sure if this was overridden in your build, though.

mhx commented 3 years ago

Quick question: is the source directory tree completely static while you're running mkdwarfs? Or is there a chance of anything in the tree being changed? I remember there was an issue in the past where mkdwarfs would crash if files were removed from the source tree while it was running. This is something I haven't really put much thought into as it doesn't seem like a very common use case (yet).

That being said, I'm just trying myself to pack several terabytes of data, we'll see how it goes. I've definitely successfully packed more files before (I think something around 14 million files), albeit much smaller ones (around 200 GiB total).

ghost commented 3 years ago

Okay, so mkdwarfs has crashed again at exactly the same point as before with exactly the same error/output (the only difference being the removal of the stack trace and the production of a core dump instead) while using the latest release binary instead of my own build. As for the source directory tree being completely static - yes, it is. Nothing changes throughout the runtime of mkdwarfs, the source data is read-only.

I am currently uploading the compressed core dump to MEGA - ETA is around 2 hours at most. I'll email you the link and any additional info when it's ready.

In the mean time, here's the output again:

nabla@satella /media/veracrypt1/squash $ "/home/nabla/Downloads/dwarfs/dwarfs-0.5.4-Linux/bin/mkdwarfs" -i "/run/media/nabla/General Store/TEMP/" -o "/run/media/nabla/General Store/feb2020vms-loose-cmpother-vms-apr2021.dwarfs" -l9
I 15:31:42.852920 scanning /run/media/nabla/General Store/TEMP/
I 15:34:29.601186 assigning directory and link inodes...
I 15:34:29.682169 waiting for background scanners...
I 20:13:53.793678 scanning CPU time: 5617s
I 20:13:53.795357 finalizing file inodes...
I 20:13:54.470505 saved 38.44 GiB / 607.9 GiB in 521740/720692 duplicate files
I 20:13:54.471414 assigning device inodes...
I 20:13:54.487971 assigning pipe/socket inodes...
I 20:13:54.504635 building metadata...
I 20:13:54.504668 building blocks...
I 20:13:54.504705 saving names and symlinks...
I 20:13:54.505205 using a 4 KiB window at 256 B steps for segment analysis
I 20:13:54.505235 bloom filter size: 512 KiB
I 20:13:54.506249 ordering 198952 inodes using nilsimsa similarity...
I 20:13:54.512030 nilsimsa: depth=20000 (1000), limit=255
I 20:13:54.650058 updating name and link indices...
I 20:13:54.780504 pre-sorted index (743986 name, 26348 path lookups) [268.4ms]
I 20:15:06.648352 198952 inodes ordered [72.14s, 64.77s CPU]
I 20:15:06.648712 waiting for segmenting/blockifying to finish...
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
writing: /run/media/nabla/General Store/TEMP//Feb2020VMs/vma/vzdump-qemu-120-2020_02_07-19_33_51.vma
181020 dirs, 80616/0 soft/hard links, 720692/720692 files, 0 other
original size: 607.9 GiB, dedupe: 38.44 GiB (521740 files), segment: 7.69 GiB
filesystem: 66.31 GiB in 1060 blocks (2610257 chunks, 1/198952 inodes)
compressed filesystem: 1042 blocks/43.5 GiB written [depth: 20000]
████████████████████████████████████████████████████████████████████████▊                                                        ▏ 56% /
terminate called after throwing an instance of 'dwarfs::runtime_error'
  what():  LZMA: unknown error 0
*** Aborted at 1619898660 (Unix time, try 'date -d @1619898660') ***
*** Signal 6 (SIGABRT) (0x3e800002788) received by PID 10120 (pthread TID 0x7ff51ddfc700) (linux TID 10124) (maybe from PID 10120, UID 1000) (code: -6), stack trace: ***
Aborted (core dumped)

The file it's listed as "writing" (/run/media/nabla/General Store/TEMP//Feb2020VMs/vma/vzdump-qemu-120-2020_02_07-19_33_51.vma) is ~79 GiB large.

ghost commented 3 years ago

Okay, the core dump has been uploaded, I've emailed you everything.

mhx commented 3 years ago

Good news: the core dump gave me enough insight that I can now reproduce the problem!

This has nothing to do at all with the size of your input data. Instead, it looks like (some of) the data has incredibly high entropy and it's giving lzma a really bad time:

(gdb) p s
$3 = {next_in = 0x7ff4c2800f40 "", avail_in = 0, total_in = 67108864, next_out = 0x7ff398401910 "", avail_out = 0, total_out = 67112080, allocator = 0x0, internal = 0x0, 
  reserved_ptr1 = 0x0, reserved_ptr2 = 0x0, reserved_ptr3 = 0x0, reserved_ptr4 = 0x0, reserved_int1 = 0, reserved_int2 = 0, reserved_int3 = 0, reserved_int4 = 0, 
  reserved_enum1 = LZMA_RESERVED_ENUM, reserved_enum2 = LZMA_RESERVED_ENUM}

You can see that lzma, while trying to compress a 67108864 byte (64 MiB) block, has actually used up every single byte of the 67112080 byte (64 MiB + 3216 bytes) output buffer. This is the worst case estimate that lzma has determined upfront.

I can reproduce the problem by creating a single 64 MiB file from /dev/urandom and trying to run mkdwarfs on it. Interestingly, the problem doesn't reproduce with smaller block sizes.

I'll work on a fix for this, but can't promise it's going to happen in the next few days.

In any case, your core dump was incredibly helpful, so thanks again!

ghost commented 3 years ago

Oh okay, that's great news. I wonder what happened with the initial crash that I reported though... That one had nothing to do with LZMA, and occurred during the scanning phase, returning a different error. In that particular scenario, the input size really was pretty spectacular. At some point I could try re-running my first command on the super large dataset again using the latest release binary and with core dumps enabled, although that dataset has been modified since my first comment here, so the results may vary :/

mhx commented 3 years ago

Yeah, the bus error is still a bit of a mystery.

My gut feeling is that something funky is happening with the input data. E.g. I typically see bus errors when running binaries over NFS. mkdwarfs uses mmap all over the place, and I wouldn't be surprised if that bus error was triggered by a memory access into an mmap'd segment that (for whatever reason) couldn't be satisfied by the OS.

mhx commented 3 years ago

This line is actually suspicious as well:

E 18:14:36.699276 error reading entry: readlink('/media/veracrypt1/squash/mp//raid0array0-2tb-2018.sqsh/Program Files (x86)/Internet Explorer/ExtExport.exe'): Invalid argument

That error would come from the code path reading the contents of a symbolic link. However, it fails to read the contents because that file isn't actually a symbolic link. Yet, that code path is only ever executed if the file flags indicate a symbolic link (S_ISLNK(mode)). So that bit is also somewhat odd and makes me wonder if the source data is consistent.

ghost commented 3 years ago

Oh, the source data is a real ugly mess. There's so much horribly mangled stuff here that I wouldn't be surprised if a lot of it drives mkdwarfs right up the wall. For context, the data I was trying to compress with my first command was a series of backups of all of my storage devices done over the course of almost an entire decade; the amount of things that could go wrong with that is actually what made me afraid to use DwarFS for this in the first place. SquashFS has worked for everything so far, but I presume SquashFS also just generally gets a lot more testing, so people have probably picked up on the kind of issues that my data would've caused beforehand.

Currently, all of the source data is actually stored in a series of SquashFS archives. My goal with DwarFS was to take all of the data in these separate SquashFS archives, dedupe all of them and produce a single DwarFS archive out of multiple SquashFS ones.

mhx commented 3 years ago

The second issue has been fixed with the latest release.

For context, the data I was trying to compress with my first command was a series of backups of all of my storage devices done over the course of almost an entire decade; the amount of things that could go wrong with that is actually what made me afraid to use DwarFS for this in the first place. SquashFS has worked for everything so far, but I presume SquashFS also just generally gets a lot more testing, so people have probably picked up on the kind of issues that my data would've caused beforehand.

Yeah, that's for sure. I'm pretty certain there's no way that DwarFS would ever corrupt your source data, but I wouldn't be surprised to see more bugs triggered by source data.

mhx commented 3 years ago

WRT the first issue, I've been running mkdwarfs on 26 TiB of data in the meantime without any problems. Given that the stack trace points to the bus error happening in a hashing library, I'm inclined to think it either something that would reproduce with a simple test case or a hardware issue.

ghost commented 3 years ago

I'm not so sure what specific property of the source data caused the first issue. I re-ran mkdwarfs on /media/veracrypt1/squash/mp//pucktop-echidna-dec2020.sqsh/.local/share/Steam/steamapps/common/, one of the parent directories specified in the "scanning" line just before it crashed, and the process finished successfully with both the release binary and my own build. server.so, the file it supposedly crashed on, and everything in the same directory, was scanned both times just fine. I'm really not sure how to reproduce the first issue without making mkdwarfs scan eeeeeeeeeeeeeeeeeverything again... and the directory I fed it this time was already massive!

mhx commented 3 years ago

server.so isn't necessarily what caused the crash. That path is only updated once every 200 ms. Also, mkdwarfs is typically scanning as many files in parallel as you've got (virtual) cores. But during the scanning phase, the different input files are processed more or less in the order of discovery, so the one that presumably caused the issue should be nearby.

I'm going to close this for now as I have no idea what could have caused this other than a low-level "disappearance" of the input data. If you happen to discover this again, please reopen the issue.