uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.65k stars 1.27k forks source link

sort opens too many files #5714

Open sylvestre opened 10 months ago

sylvestre commented 10 months ago

With our implementation:

mkdir -p in

for i in $(seq 5); do
  echo $i >in/$i
done

(seq 6 && echo 6) >exp
echo 6 >out
(exec 3<&- 4<&- 5<&- 6<&- 7</dev/null 8<&7 9<&7 &&
 ulimit -n 10 &&
 sort -n -m --batch-size=7 -o out out in/1 in/2 in/3 in/4 in/5 out
)

returns: sort: failed to open temporary file: Too many open files GNU works without issue

Tested by GNU in tests/sort/sort-merge-fdlimit.sh

4t8dd commented 9 months ago

I would like to fix this. Please assign this issue to me. Thanks.

cakebaker commented 9 months ago

@4t8dd sure, go ahead :) While @allaboutevemirolive opened a PR, they are currently not working on it according to https://github.com/uutils/coreutils/pull/5738#issuecomment-1870965171:

I have an exam coming up in the first month of 2024. Please feel free to create a new PR and close this one. If no new PR is issued, I will investigate this matter in my spare time. :)

4t8dd commented 9 months ago

@4t8dd sure, go ahead :) While @allaboutevemirolive opened a PR, they are currently not working on it according to #5738 (comment):

I have an exam coming up in the first month of 2024. Please feel free to create a new PR and close this one. If no new PR is issued, I will investigate this matter in my spare time. :)

OK. Thanks for your reply. I will try it. @allaboutevemirolive if you got the solution, let me know.

allaboutevemirolive commented 9 months ago

Here are my thoughts on this problem:

The issue is in coreutils/src/uu/sort/src/merge.rs, specifically in this line:

std::fs::copy(file_path, &copy_path).map_err(|error| SortError::OpenTmpFileFailed { error })?;

...because in the bash script above, we pass the -m flag to merge the files.

I archived the text below because it's completely unrelated. The original and destination paths are different. However, the note in it could be useful in the future.

Click to expand/collapse If you look up `fs::copy` on Google, you'll find that people have trouble copying to the `same file` because it also copies the file's metadata. Yet, if you copy the file information to a `different file`, there are no issues. That's why the bash script works but still gives an error about opening a temporary file. But, actually, it's not an issue with opening a temporary file. I think the way `fs::copy` acts, throwing an error when copying to the same target, is on purpose to avoid data corruption. I believe the correct approach is to handle copying file information to both: - different file targets and - the same file target

You can try looking at cp.rs to see how copying is done. I can't test it now, but I will when I have time.

That's first.

The second thought is that by only addressing the problem above, you can solve the next problem, which is Too many open files since sort throws two errors here.

sort: failed to open temporary file: Too many open files

Another thing is that my thoughts above might be wrong since increasing ulimit also works, but it is not a clean way to address this issue.

4t8dd commented 9 months ago

@allaboutevemirolive Thanks for your sharing.

I did yet start to read the code. But I do have some investigatios about ulimit:

I got some tests on linux, mac and also have some comparision with sort from linux and mac. I got some interesting findings.

mac: ulimit -n 14 will work. With my test, 14 should be the least num to work for batch size = 7. But when I set batch_size to 700000 and it also works. This is crazy!!. And if I set n >= 14, batch size can be any number and it would always work. Oh~~

Linux: default opened files are 25. when set batch size to 7, n at least 10. Or I will get

sort: --batch-size argument ‘7’ too large
sort: maximum --batch-size argument with current rlimit is 6

This is interesting. If I set batch size to 6 then batch size could be 9. So there is a formula here: n = batch_size + 3 Then when I try to set large number to batch size its behaviour is deterministic. It follows this formula strictly. It does not only give the error but also gives the maxium size of batch size like:

sort: --batch-size argument ‘700000’ too large
sort: maximum --batch-size argument with current rlimit is 397

Here I set n to 400 and batch size to 700000.

For coreutils in Linux and Mac.

It got the same behaviour. n at least should be 16. But batch size can be any number now. And if n >= 16 you can set batch size to any number. This is very bad.

#!/bin/bash

mkdir -p in

for i in $(seq 5); do
    echo $i >in/$i
done

(seq 6 && echo 6) >exp
echo 6 >out
(
    exec 3<&- 4<&- 5<&- 6<&- 7</dev/null 8<&7 9<&7 &&
        ulimit -n 16 &&
        lsof -p $$ | wc -l &&
        sort -n -m --batch-size=7000000 -o out out in/1 in/2 in/3 in/4 in/5 out
)

So linux sort should be the expected one. Even I did not know why the formula works for linux. The behaviour should be determinitic anyway. I will read the code then to find out the tricks.

allaboutevemirolive commented 9 months ago

I tried to see the source code again that created tempfile to perform the copying operation and read the tempfile documentation. I do think the current tempfile wrapper (because it uses the tempdir operation to create a temporary file, not the tempfile operation) in coreutils has resource leaks, so it didn't close the tempfile properly after being out of scope.

That's why rust compiler emits the Too many open files error

sort: failed to open temporary file: Too many open files

I'm not sure why a tempfile is needed in the first place to perform merging or copying. Maybe, if you have time, you can try tweaking the code so it uses proper tempfile operations.

Quote from tempfile docs:

tempfile will (almost) never fail to cleanup temporary resources. 
However TempDir and NamedTempFile will fail if their destructors don’t run. 
This is because tempfile relies on the OS to cleanup the underlying file, while TempDir and NamedTempFile rely on rust destructors to do so. 
Destructors may fail to run if the process exits through an unhandled signal interrupt (like SIGINT), or if the instance is declared statically (like with [lazy_static](https://github.com/rust-lang-nursery/lazy-static.rs/issues/62)), among other possible reasons.
4t8dd commented 8 months ago

I think I figure this out. In GNU sort:

merge at most NMERGE inputs at once;\n\
                            for more use temp files\n\

But here, we use more temp files.

  1. if output in the input files list, replace it with tmp file and clone the content.
  2. when to process batch size, there would be a tmp file as intemedia file. if batch size smaller, more intermedia files would be used.
  3. open all the input files for processing.

If we should align with gnu, then more code would requires be changed. And sort in Mac behaves differently. But our sort behave the same.

So my solution here is: leave user for the ulimit and batch size work. we just need to tell if it is good to move on which is currently missing.

WesleyBlancoYuan commented 5 months ago

n = batch_size + 3

I guess the 3 already opened files are stdin, stdout, stderr.

lavafroth commented 1 month ago

Hello there, after investigating the test tests/sort/sort-merge-fdlimit.sh, it appears that the error arises from replace_output_file_in_input_files in merge.rs.

We should try chunking the iterator of files first, then call this function on each chunk to ensures atleast one temp file can be created per chunk.