Closed tmsd2001 closed 3 years ago
This may be related to the memory shortage, I'll need to debug some.
Also please post the output of the scan with the additional option "--debug-size 180457", this will show state transition details only for the files of size 180457 which caused the crash above.
The same error happens for my files. With the command ./dupd scan --path /rpool/data/nas/
and an unchanged dataset the error is reproducible in any attempt.
Log file of ./dupd scan --path /rpool/data/nas/ --debug-size 1495743
with Files scanned <num>
and Processed <num>
lines removed.
dupd.log
That's a strange log as none of the three files appear to be identified yet as a unique. I've been changing the state processing in the development branch and clearly has some bugs still. Will continue working on it. Do you see any problems with the release branch (1.7)?
The 1.7 release does a great job.
Same here. If there's anything I can do to help troubleshoot, do let me know. I downgraded to 1.7 and running it now. It seems to work well so far.
Definitely related to memory limits - if I didn't set them for a specific set of files, the process would just take up all system memory and be killed. When I set a memory limit (1500M for instance) it will exit like that.
Thanks for putting this together, though - what an awesome tool.
How much RAM do you have? How large does the resident size get with default memory limits vs. when you set a --buflimit?
A problem is that --buflimit only applies to the read buffers. If RAM is tight and file count is high, it is possible to use up a significant percentage of memory before starting to fill the buffers, so total memory consumption becomes excessive. I plan to start accounting for all the data structures so that it is more aware of total memory usage.
Actually, I tried removing all large programs from memory, so I could run it without buflimit. I have 4GB of RM, on Linux. The app reaches around 2.5-3GB, buffer seems to reach 100%, and immediately crashes like this afterwards.
I'm happy to provide more logs, if I get some pointers. 1.7 didn't seem to work either, but I'm back to running it now with more memory, and hopefully it'll go well now.
This is the status so far, on 1.7, 2GB memory taken by the app so far, stable so far:
Files: 102667 0 errors 35 sms Sets : 1589/ 24779 4262213K ( 24355K/s) 0q 94%B 4222f 175 sms
Fingers crossed that now it works until the end :)
Actually no, it finished at:
Files: 102667 0 errors 35 sms Sets : 24375/ 24779 244811845K ( 45444K/s) 0q 74%b 846f 5387 serror: unable to allocate read buffer, sorry!
I'll try 1.7 with buflimit and see how it goes.
The uppercase B in "94%B" means at that point it had already hit the buflimit and is trying to free up memory. The %b will go up as buffers fill up. If it reaches 100%b it'll switch to %B and it should start going down (slowly).
From many data points, it does seem 4GB isn't quite enough to run it unless file/set count is low (it's more about how scattered the data is, not strictly file count, but works as an approximation).
If you feel like experimenting, try version 1.6 with the --ssd option. I took it out in 1.7 to avoid maintaining duplicate code paths, but the --ssd scan is always sequential so it needs a lot less buffer space. It may be that I need to reintroduce a form of linear scan for when memory is tight.
Certainly, I shall!
To be clear, this has worked perfectly fine in most of the cases I had used it for so far, and the latest version's performance is just awesome. I'm just raising this here in case I can help troubleshoot in any way - I am very thankful for what you put together here, and how impressive it is!
Thank you.
Just to update that 1.6 with --ssd option was what worked for me. Thank you!
Master and 1.7, even with more memory ended up failing for some reason - at least memory usage never got even close to the max. I am using a 32-bit OS, unsure if that'd be a factor.
Thanks once again and have a great weekend! If there's anything I can do to help let me know, I'm game.
Another data point:
Built from https://github.com/jvirkki/dupd/commit/62f2798743b0acc0424a1b8c33603f023df6fb4b 10GB ram free (of 32GB), 29GB of swap free Linux version 4.9.0-8-amd64 ... SMP Debian 4.9.130-2 (2018-10-27) Debian 9.13 (yes, I know, upgrade it already) 2.4TB folder, 219000 files 1 CPU, 6 cores ZFS
# ./dupd scan -p /dir1
Files: 207520 0 errors 21 s
Sets : 15166/ 62814 204354689K ( 59561K/s) 0q 88%B 97062f 3431 serror: set entry state FS_UNIQUE but current state is FS_BUFFER_READY
----- dump path block list for size 1287306 -----
bad state
head: 0x7feff163a989
last_elem: 0x7feff163aa0b
list_size: 2
wanted_bufsize: 524288
buffer_ready: 1
state: PLS_NEED_DATA
hash_passes: 1
have_cached_hashes: 0
sizelist back ptr: 0x7fefe4a41250
forward ptr back to me: 0x7feff163a989
first_elem: 0x7feff163a9a9
--entry 1
file state: FS_NEED_DATA
filename_size: 18
dir: 0x7feff257d864
fd: 52322
next: 0x7feff163aa0b
buffer: 0x7fef09e41e70
bufsize: 65536
data_in_buffer: 0
file_pos: 65536
next_read_byte: 65536
next_buffer_pos: 0
next_read_block: 0
blocks: 0x7fefe4995d10
hash_ctx: 0x7fefd822f900
BLOCK LIST: count=1
[0] start_pos: 0 , len: 1287306 , block: 134778
filename (direct read): [file-with-18-character-filename.jpg]
built path: [/dir1/1/file-with-75-character-filename-and-path.jpg]
--entry 2
file state: FS_BUFFER_READY
filename_size: 18
dir: 0x7feff2580eab
fd: 8123
next: (nil)
buffer: 0x7fed26a00570
bufsize: 524288
data_in_buffer: 524288
file_pos: 589824
next_read_byte: 589824
next_buffer_pos: 524288
next_read_block: 0
blocks: 0x7fefe4a41e70
hash_ctx: 0x7fefd822f960
BLOCK LIST: count=1
[0] start_pos: 0 , len: 1287306 , block: 13617
filename (direct read): [file-with-18-character-filename.jpg]
built path: [/dir1/2/file-with-56-character-filename-and-path.jpg]
counted entries: 2
valid entries: 2
-----
For now please use the most recent released version (build from the 1.7 tag).
The dev release code is in progress and while I generally try to keep in usable shape, it can have bugs. Right now the 2.0-dev (master branch) is in the middle of some state management refactoring and has the bug described in this issue.
(Good news is I ran into this on one data set on one of my machines, so I can finally reproduce it locally. Been too busy for months to have time to do so, but will debug it when I have a moment. Meanwhile it is best to use the 1.7 release.)
Thanks jvirkki - 1.7 is working. It's at 20% after ~40 minutes, using >16GB of ram, so I'm not sure this program fits this use case (I'm sure I'll find other uses for it). It was using some swap (SSD), and lower CPU than jdupes (probably related).
# ./dupd scan -p /smallset
Files: 207812 0 errors 20 s
Sets : 22201/ 62824 234327418K ( 98955K/s) 0q 38%b 43310f 2368 s
It's a 2.4TB data set (is the reported size in KB or *10B?) something up with that.
jdupes, on the other hand, took 143 minutes but topped out at ~50MB ram
# time jdupes -r /smallset > dupes
Scanning: 216672 files, 2625 items (in 1 specified)
real 143m3.863s
user 12m54.680s
sys 20m2.492s
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
me 32110 23.3 0.1 49996 47016 pts/4 D+ 07:45 30:15 jdupes -r /smallset
Thanks again,
Yes, dupd uses a lot of RAM for buffer caches (up to 50% of RAM by default for the cache, not including other internal data structures). So over 16GB is expected on your 32GB machine.
You can use --buflimit (e.g. --buflimit 8G) to constrain the buffer cache size, but unless you're memory constrained it's more often best to let it use the memory.
For anyone who has experienced the "set entry state FS_UNIQUE but current state is FS_BUFFER_READY" error, I just added a change which fixes it for (at least) the data set where I was able to reproduce the behavior. If you see this error on your data sets going forward please update this bug.
I believe this is fixed as noted in previous comment (in dev code for now, not in a released version yet). If anyone encounters it in the future please file new bug with details.
Version 1.7 work fine.
Version 2.0-dev error: set entry state FS_UNIQUE but current state is FS_BUFFER_READY