pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

High CPU-usage while hardlinking files #138

Closed hans-helmut closed 2 years ago

hans-helmut commented 2 years ago

Hello,

while fclones was one of the rare duplicate-finders that managed to find all duplicates in a reasonable amount of time (below 2 days) in my backup, hardlinking is very slow.

# fclones --version
fclones 0.25.0

top -H -c shows 3 threads with high CPU-usage:

# top -H -c

top - 15:43:21 up 5 days, 30 min,  5 users,  load average: 6,45, 7,32, 7,28
Threads: 346 total,   5 running, 341 sleeping,   0 stopped,   0 zombie
%CPU(s): 32,8 us, 43,6 sy,  0,0 ni,  0,0 id, 23,5 wa,  0,0 hi,  0,1 si,  0,0 st
MiB Spch:  32055,4 total,    372,4 free,   6103,3 used,  25579,7 buff/cache
MiB Swap:  10240,0 total,   5483,6 free,   4756,4 used.  25498,2 avail Spch

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     ZEIT+ BEFEHL                                                                                                                                                                   
  39209 root      20   0  346956  24516   4856 R  99,9   0,1 102:39.10 fclones link                                                                                                                                                                 
  39210 root      20   0  346956  24516   4856 R  99,9   0,1 106:14.93 fclones link                                                                                                                                                                 
  39211 root      20   0  346956  24516   4856 R  99,9   0,1  95:29.37 fclones link      
[...]

Connecting strace -p <PID>to the threads shows that one thread is calling

statx(AT_FDCWD, "/filename/...", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=47320, ...}) = 0

for different versions of a file, while all other threads are calling

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

in a loop. So there seems to be some active waiting, while deleting files

Environment

pkolaczk commented 2 years ago

This is quite likely caused by the fact the internal sequence of commands is generated in a single thread, but the commands are processed in parallel. So generating the stream of commands became a bottleneck and deduplication threads are simply fighting for work, actively spinning (this is how rayon works).

Generating commands was made single-threaded due to other feature request about making the stream of commands match the order of the input files. I need to find a different way.

Two ideas here:

  1. use rayon ParallelIterator to generate commands, but record the order in each item and then put the commands in correct order when printing them
  2. switch from rayon to async which would allow a lot more flexibility
hans-helmut commented 2 years ago

This seems to be the matching rayon issue.