pauldreik / rdfind

find duplicate files utility
Other
951 stars 77 forks source link

Ways to speed up rdfind runs: Parallelization #113

Closed PoC-dev closed 1 year ago

PoC-dev commented 2 years ago

Besides other issues mentioning the wish about caching of checksum data, I see the possibility to speed up rdfind checks by either introducing threads, or (probably easier to implement) forking a child process for each non-option argument given on the command line.

With that, modern multi-core CPUs, and fast SSDs can be made more useful for rdfind. Trade-off is increased complexity to spawn sub-tasks, and get/merge result data from them in memory or temporary files.

This won't help for one-tree runs of rdfind, though. A probable alternate approach might be:

n might be a command line parameter -numcpu *n*, defaulting to the count of CPU cores being detected. On OS flavors where this isn't possible (or unknown how to do), a sensible default might be 1, effectively reverting to the current behavior, but effectively preventing unnecessary slowdown from excessive CPU cache thrashing, if there is just a single CPU.

Opinions?

fziglio commented 2 years ago

Personally for storage I'm still using rotative disks. In this case parallelism on disks would increase the seeks which is not that good. But there's another possibility to use parallelism, separate disk activities and computation activities. Read from one thread and compute hashes in another thread.

Even if you are using SSD, are you sure the bottleneck is CPU and not disk speed? Maybe a different way to read files could help in some cases? It seems absurd but in some cases (like large directories and files) the usage of O_DIRECT (so no cache basically) could help.

But I think these optimizations without multiple use cases and lot of benchmarking are mainly speculations.

PoC-dev commented 2 years ago

About rotating disks: I suggest(ed) to have a command line parameter to (re)set parallelism to 1 for single-disk "optimization".

I'm not so sure about the possibility of disk- and computation parallelism. Computation needs the result of disk I/O. The easiest way to compute is to wait until each and every class of checks is finished and then apply the results to the remaining files in the big list. There's a tradeoff about ease of implementation and clear code vs. rather complex algorithms to enhance parallelism even more.

I think that implementing any means of parallelism would make rdfind finish its duty faster than currently. Further optimizations might be considered when real world data shows how a parallelizing rdfind behaves in the different environments. You have a single disk environment, I'm running it often on big server environments with many CPU cores, a lot of RAM and disk arrays allowing many I/Os per second. Some spinning rust, some SSD. All in all, serializing programs like rm -r or rdfind simply can't generate I/Os fast enough to utilize the performance possibilities of modern storage in a server environment.

Rdfind has multiple stages of work. Some are clearly I/O bound, some are more CPU bound. My recommendations aren't only about I/O, but also to run more threads/child tasks to utilize parallelism for CPU bound work, such as full checksum calculation. The bottleneck shifts, depending on the class of check being done.

Thus I don't understand how you come to the conclusion that my suggestions are moot? You are the one assuming a single disk as sole use-case, apparently. If parallelism is moot, why is it so ubiquitous in many applications? Think of unix |. Think of Apache. And many more

fziglio commented 2 years ago

I think you got my comments in a wrong way.

I don't get how you understood I concluded parallelism is moot, I even suggested a different way to implement it in rdfind. I develop in a company implementing network programs dealing with hundreds of connections so parallelism is a must. But that does not mean it's the solution to every problem.

In the initial message you spoke about "multicore CPUs and fast SSDs" but this apply to also any recent laptop, your use case with plenty of disks and cores is far from a laptop and obviously need different approaches. Personally I would retain a default similar to the current behaviour and add option to increase resource usages instead of changing default and adding an option to get the old behaviour. Not all people has dozens of cores and disks. There's also a -sleep option in rdfind to slow down the scan meaning that some other people have different use cases. I suppose not all people have the first target as run as fast as possible but maybe they want to work while rdfind is running.

If, in your use cases, one thread/core can't reach the full speed and you want rdfind to finish as fast as possible then yes, parallelism is the way to go. And possibly if you have a very powerful machine any parallelism will make the program faster. But that does not mean it's the best solution. What I tried to say is to try to understand which could be the bottleneck, even in your case. Using parallelism increase the performance as long as the different tasks do not starve too much the resources, than performances start to decrease. So understanding if it's more a CPU issue or a disk one would help understanding how to divide the tasks and resources usage and not starving a specific resource.

About separating disk and CPU activities I don't think it's pretty complicated, mostly a consumer/producer with some buffer pool, pretty standard implementations you can find plenty of examples.

PoC-dev commented 1 year ago

Apparently the project has no interest to introduce parallelization in rdfind to provide faster results. No comments despite one person constantly questioning my arguments. Closing.