excessive memory usage on huge trees

cehteh commented 2 years ago

When traversing some huge directory (5TB data/ 200Million Entries) i noticed that 'fd' will use excessive amounts of memory (I killed it at 35GB RSS). Some investigation shown that 'fd | wc -l' or 'fd >/dev/null' does not have this problem. While 'fd | less' again shows the same problem.

So my guess is that fd uses some unbounded queues to send the output which pile up because the terminal emulator is too slow to print 200 Million entries in time.

tmccombs commented 2 years ago

Which version of fd are you using?

cehteh commented 2 years ago

$ fd --version fd 8.3.0

installed by cargo install

tmccombs commented 2 years ago

Ah, I thought that we had set a limit on the channel size in #885, but it looks like that part got removed. So, yes, there is an unbounded queue of files and directories to process, and that could cause high memory usage if the terminal can't keep up. Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

sharkdp commented 2 years ago

but it looks like that part got removed

Right. I removed it because you wrote "I switch back to using channel instead of sync_channel than it is, at least not any slower". And I believe I was able to confirm this in my own benchmarks.

Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

Yes, let's do this in a dedicated PR with new benchmark results (and memory profiles). A quick fix could be to use a much higher limit than the one suggested in #885.

cehteh commented 2 years ago

hint: in my own directory traversal code i found that about 4k per thread (for memory calculation, not one channel per thread) is a somewhat sweet spot. Actually the limit isn't terribly important as long it doesn't explode, even with 100MB limit it should be okish.

mwaitzman commented 2 years ago

Is this the cause of fd outputting "Killed" and then quitting when I search for all files with a specific extension from the root (Arch Linux, ~9/~14TB used space to search through across 5 mounted partitions)?

tavianator commented 2 years ago

@mwaitzman It might be, do you see anything about the OOM killer in dmesg?

mwaitzman commented 2 years ago

Yes, I see when running a similar command

[ 4279.148487] Out of memory: Killed process 7330 (fd) total-vm:61310736kB, anon-rss:24569660kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:111272kB oom_score_adj:0
[ 4285.522887] oom_reaper: reaped process 7330 (fd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

tavianator commented 2 years ago

Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that std::sync::mpsc::sync_channel() (the bounded queue) performs so much worse than channel() (unbounded).

The SPSC case (-j1) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of 4096 * threads, bounded queues match unbounded queues at -j8 but not with fewer threads.

I'd like to figure out https://github.com/sharkdp/fd/issues/933.

cehteh commented 2 years ago

On 2022-06-03 12:37, Tavian Barnes wrote:

Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that std::sync::mpsc::sync_channel() (the bounded queue) performs so much worse than channel() (unbounded).

The SPSC case (-j1) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of 4096 * threads, bounded queues match unbounded queues at -j8 but not with fewer threads.

Just an observation I did with other code:

Walking directories can be very fast when they are in RAM (obliviously) but otherwise it is pretty much I/O-bound. And here lies the trick: since you are not CPU-Bound but IO-bound you can have a lot more threads than you have cores. Which in turn gives the Kernel the opportunity to schedule requests in a much better way. I've tested up to 128 threads on a 12/24 core machine and still seen improvements (in my code, not the 'fd' tree walking code). This can by far outweigh the performance degrade you get from bounded channels.

I'd like to figure out https://github.com/sharkdp/fd/issues/933.

-- Reply to this email directly or view it on GitHub: https://github.com/sharkdp/fd/issues/918#issuecomment-1146299136 You are receiving this because you authored the thread.

Message ID: @.***>

Architector4 commented 1 year ago

Experiencing the exact same issue, but with --exec. I'm trying to run fd "png$" / --exec optipng -o 7 (I know that this usage of optipng may break stuff, I know what I'm doing).

I've tried leaving different variations running on my laptop for the night, finding it absolutely frozen every time i wake up, arriving at fd "png$" --threads=1 --exec sh -c 'optipng -o 7 "{}"; sync; sleep 1m' to try to minimize impact. At this point I've checked with a task manager, and it's indeed fd ballooning.

Currently on fd 8.4.0 from Arch Linux repositories. Is there anything I can do on my end to do what I want, besides resorting to find?

tmccombs commented 1 year ago

@Architector4 you might have some luck filtering out some directories. It looks like you are running on /, so excluding /proc, /sys, /tmp, etc. Would likely make a difference (if you are on mac those paths might be different). If the files you care about are all on the same filesystem you could try the --one-file-system option.

Architector4 commented 1 year ago

@tmccombs I don't think it would make any meaningful difference in my case -- I'm looking for all files that end with "png", of which there would not be any in such special directories. I imagine (I hope?) that fd does not put files not matching the search terms into that ballooning cache.

I'd also prefer to go through all filesystems (my SSD mounted on / and my HDD mounted in /media), so I don't want to use that option either. But yeah, those are good ways to help alleviate the problem for other cases where applicable.

In my case I decided to resort to find and have had good success with it. This is what I ended up running to achieve the same task without running out of memory, with GNU find and parallel:

find / -iname "*.png" -print0 | parallel -0 optipng -o 7 {}

tmccombs commented 1 year ago

Oh, I think you're right. In your case probably what is happening is the queues are filling up because running optipng is blocking processing additional entries, and there isn't any backpressure to stop new entries added to the queue. If you were to use fd piped to parellel, my guess is it would probably work better. What we really need here is a bounded queue. The question is how to do that without hurting performance in cases where backpressure isn't needed.

Architector4 commented 1 year ago

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

In any case, this feels like a necessary feature for wellbeing of the host machine, because it seems like this excessive memory usage applies to any general case where such output is slower than such input. To demonstrate, I have a silly script, named delaystream.sh:

while read -r line; do
    echo "$line"
    sleep 1
done

This script prints each line it receives on stdin to stdout, but waits 1 second between them, causing artificially slow I/O. So I piped fd to it:

fd / | delaystream.sh

This reliably makes fd balloon in memory usage.

I argue some kind of a limit is absolutely necessary here, and is more important than the performance from not having to account for it.

Maybe it would be possible to introduce a command line switch that disables the limit for those who wish to have the performance benefits and know ballooning wouldn't happen?

tavianator commented 1 year ago

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

Ideally yes, but: https://github.com/sharkdp/fd/issues/918#issuecomment-1146299136

In any case, this feels like a necessary feature for wellbeing of the host machine

I agree, we need to do something. I can try switching to crossbeam-channels again.

Architector4 commented 1 year ago

Oh, sorry, missed that.

Would it be possible to slap something like Either<channel, sync_channel> in the code and operate on whichever one it is, and have the user pick between them via commandline switches, and default to the bounded queue for safety, or at least let the user pick the bounded one if they wish? This feels like an acceptable strategy until a better solution is found, and at least makes it possible to use fd in use cases like mine.

sharkdp / fd

excessive memory usage on huge trees #918