Closed cehteh closed 1 year ago
Which version of fd are you using?
$ fd --version fd 8.3.0
installed by cargo install
Ah, I thought that we had set a limit on the channel size in #885, but it looks like that part got removed. So, yes, there is an unbounded queue of files and directories to process, and that could cause high memory usage if the terminal can't keep up. Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.
but it looks like that part got removed
Right. I removed it because you wrote "I switch back to using channel instead of sync_channel than it is, at least not any slower". And I believe I was able to confirm this in my own benchmarks.
Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.
Yes, let's do this in a dedicated PR with new benchmark results (and memory profiles). A quick fix could be to use a much higher limit than the one suggested in #885.
hint: in my own directory traversal code i found that about 4k per thread (for memory calculation, not one channel per thread) is a somewhat sweet spot. Actually the limit isn't terribly important as long it doesn't explode, even with 100MB limit it should be okish.
Is this the cause of fd outputting "Killed" and then quitting when I search for all files with a specific extension from the root (Arch Linux, ~9/~14TB used space to search through across 5 mounted partitions)?
@mwaitzman It might be, do you see anything about the OOM killer in dmesg
?
Yes, I see when running a similar command
[ 4279.148487] Out of memory: Killed process 7330 (fd) total-vm:61310736kB, anon-rss:24569660kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:111272kB oom_score_adj:0
[ 4285.522887] oom_reaper: reaped process 7330 (fd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that std::sync::mpsc::sync_channel()
(the bounded queue) performs so much worse than channel()
(unbounded).
The SPSC case (-j1
) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of 4096 * threads
, bounded queues match unbounded queues at -j8
but not with fewer threads.
I'd like to figure out https://github.com/sharkdp/fd/issues/933.
On 2022-06-03 12:37, Tavian Barnes wrote:
Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that
std::sync::mpsc::sync_channel()
(the bounded queue) performs so much worse thanchannel()
(unbounded).The SPSC case (
-j1
) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of4096 * threads
, bounded queues match unbounded queues at-j8
but not with fewer threads.
Just an observation I did with other code:
Walking directories can be very fast when they are in RAM (obliviously) but otherwise it is pretty much I/O-bound. And here lies the trick: since you are not CPU-Bound but IO-bound you can have a lot more threads than you have cores. Which in turn gives the Kernel the opportunity to schedule requests in a much better way. I've tested up to 128 threads on a 12/24 core machine and still seen improvements (in my code, not the 'fd' tree walking code). This can by far outweigh the performance degrade you get from bounded channels.
I'd like to figure out https://github.com/sharkdp/fd/issues/933.
-- Reply to this email directly or view it on GitHub: https://github.com/sharkdp/fd/issues/918#issuecomment-1146299136 You are receiving this because you authored the thread.
Message ID: @.***>
Experiencing the exact same issue, but with --exec
. I'm trying to run fd "png$" / --exec optipng -o 7
(I know that this usage of optipng may break stuff, I know what I'm doing).
I've tried leaving different variations running on my laptop for the night, finding it absolutely frozen every time i wake up, arriving at fd "png$" --threads=1 --exec sh -c 'optipng -o 7 "{}"; sync; sleep 1m'
to try to minimize impact. At this point I've checked with a task manager, and it's indeed fd
ballooning.
Currently on fd 8.4.0
from Arch Linux repositories. Is there anything I can do on my end to do what I want, besides resorting to find
?
@Architector4 you might have some luck filtering out some directories. It looks like you are running on /, so excluding /proc, /sys, /tmp, etc. Would likely make a difference (if you are on mac those paths might be different). If the files you care about are all on the same filesystem you could try the --one-file-system option.
@tmccombs I don't think it would make any meaningful difference in my case -- I'm looking for all files that end with "png", of which there would not be any in such special directories. I imagine (I hope?) that fd
does not put files not matching the search terms into that ballooning cache.
I'd also prefer to go through all filesystems (my SSD mounted on /
and my HDD mounted in /media
), so I don't want to use that option either. But yeah, those are good ways to help alleviate the problem for other cases where applicable.
In my case I decided to resort to find
and have had good success with it. This is what I ended up running to achieve the same task without running out of memory, with GNU find
and parallel
:
find / -iname "*.png" -print0 | parallel -0 optipng -o 7 {}
Oh, I think you're right. In your case probably what is happening is the queues are filling up because running optipng
is blocking processing additional entries, and there isn't any backpressure to stop new entries added to the queue. If you were to use fd piped to parellel, my guess is it would probably work better. What we really need here is a bounded queue. The question is how to do that without hurting performance in cases where backpressure isn't needed.
I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec
or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.
In any case, this feels like a necessary feature for wellbeing of the host machine, because it seems like this excessive memory usage applies to any general case where such output is slower than such input. To demonstrate, I have a silly script, named delaystream.sh
:
while read -r line; do
echo "$line"
sleep 1
done
This script prints each line it receives on stdin to stdout, but waits 1 second between them, causing artificially slow I/O. So I piped fd
to it:
fd / | delaystream.sh
This reliably makes fd
balloon in memory usage.
I argue some kind of a limit is absolutely necessary here, and is more important than the performance from not having to account for it.
Maybe it would be possible to introduce a command line switch that disables the limit for those who wish to have the performance benefits and know ballooning wouldn't happen?
I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to
--exec
or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.
Ideally yes, but: https://github.com/sharkdp/fd/issues/918#issuecomment-1146299136
In any case, this feels like a necessary feature for wellbeing of the host machine
I agree, we need to do something. I can try switching to crossbeam-channels again.
Oh, sorry, missed that.
Would it be possible to slap something like Either<channel, sync_channel>
in the code and operate on whichever one it is, and have the user pick between them via commandline switches, and default to the bounded queue for safety, or at least let the user pick the bounded one if they wish? This feels like an acceptable strategy until a better solution is found, and at least makes it possible to use fd
in use cases like mine.
When traversing some huge directory (5TB data/ 200Million Entries) i noticed that 'fd' will use excessive amounts of memory (I killed it at 35GB RSS). Some investigation shown that 'fd | wc -l' or 'fd >/dev/null' does not have this problem. While 'fd | less' again shows the same problem.
So my guess is that fd uses some unbounded queues to send the output which pile up because the terminal emulator is too slow to print 200 Million entries in time.