sharkdp / bat

A cat(1) clone with wings.
Apache License 2.0
48.72k stars 1.23k forks source link

files of infinite length cause bat to crash #636

Open rgreenblatt opened 5 years ago

rgreenblatt commented 5 years ago
bat /dev/zero
memory allocation of 34359738368 bytes failed
sharkdp commented 5 years ago

Thank you for reporting this. I can reproduce this.

I guess this is caused by the fact that we read every single line into a Vec<u8> buffer without any size restriction: https://github.com/sharkdp/bat/blob/187a3b9341f3965425e64c84b349769cf1eaffc6/src/inputfile.rs#L38-L40

treere commented 4 years ago

Hi, I'm working on a fix. The possible solutions are:

Which solution is better? What I should do with the highlight?

sharkdp commented 4 years ago

I'm working on a fix.

Sounds great.

Which solution is better? What I should do with the highlight?

I would have to experiment with this myself to answer your questions. I could imagine that this is not easy to fix. For example, we need to make sure to still support the non-buffering/"immediate-response" behavior of bat when reading from STDIN, etc.

split a line longer then X into small pieces

Without having thought too much about it, I believe that would be my first try. Not sure if we can feed a non-finished line to the highlighter.

Enselic commented 2 years ago

Anyone looking to fix this issue should take a look at this PR which was an attempt at doing just that: https://github.com/sharkdp/bat/pull/712

I am explicitly mentioning that so we can close the above PR to keep the PR inbox clean.

aarondill commented 1 year ago

I'd like to point out in case it has not yet been realized, this also affect when bat is piped into another program (ie bat /dev/zero | tr '\0' '0'). This is not the effect that the standard cat has when reading a file, breaking the drop-in replaceability of cat.

""" Whenever the output of bat goes to a non-interactive terminal, i.e. when the output is piped into another process or into a file, bat will act as a drop-in replacement for cat(1) and fall back to printing the plain file contents. """

deboard commented 8 months ago

Please assign this issue to me.

deboard commented 8 months ago

Interesting... With the machine I am on right now I get "memory allocation of 68719476736 bytes failed" then a core dump. I see this type in the rust docs: https://docs.rs/typenum/latest/i686-unknown-linux-gnu/typenum/consts/type.N68719476736.html pub type N68719476736 = NInt The original number was 34359738368, this is also a type: https://docs.rs/typenum/latest/i686-unknown-linux-gnu/typenum/consts/type.U34359738368.html

It looks like rust has expanded the max amount of memory that can fit inside a Vec since this was first reported (isn't this a Vec memory problem?).

einfachIrgendwer0815 commented 8 months ago

It looks like rust has expanded the max amount of memory that can fit inside a Vec since this was first reported (isn't this a Vec memory problem?).

I don't think it's a Vec memory problem because a line in an infinitely long file might as well be infinitely long (which is definitely true for /dev/zero). An infinitely long line would require an infinite amount of memory, no matter how much actually fits in a Vec (which likely depends on how much memory the system has?).

deboard commented 8 months ago

True.

It looks like if you call std::fs::metadata("/dev/zero").unwrap().len(), the len value is zero. It might be a good fix to check the length of a (any) file and if it is 0 don't try to read it in.

deboard commented 8 months ago

Hmm, well i have tried several different things. The latest being

                    if !path.is_file() {
                        // File does not exist or is a dir?
                        let fname = String::from(path.to_string_lossy());
                        return Err(format!("'{fname}' does not exist.").into());
                    }

                    let fname = String::from(path.to_string_lossy());
                    let meta = std::fs::metadata(&fname).unwrap();
                    if !meta.is_symlink() {
                        if  meta.len() == 0 {
                            // If the file is zero length do not attempt to open it.
                            // Files like /dev/zero will cause bat to crash or be killed.
                            return Err(format!("'{fname}' is a zero length file.").into());
                        }
                    }

But that breaks the test below bacause a zero length error is returned. But I test is the file exists and if it is a symlink

assets::tests::syntax_detection_for_symlinked_file' panicked at src/assets.rs

What I don't get is the file passed into Input::ordinary_file in that test seems to be a symlink... Any ideas?

einfachIrgendwer0815 commented 8 months ago

std::fs::metadata follows symlinks. So, when the file is a real file, std::fs::metadata returns that file's metadata. But when the file is a symlink, std::fs::metadata returns the metadata of the file (or directory) the symlink points to. std::fs::symlink_metadata should do the trick.

However, I'm not certain a zero-size check is sufficient since files under /proc report to have a size of 0 without being an infinite file. Additionally, real empty files will also have a size of 0 and bat shouldn't mind reading those. Though, apparently /dev/zero doesn't claim to be a real file, but a character device instead.

einfachIrgendwer0815 commented 8 months ago

A different idea could be the following: When the length of a line exceeds a certain limit, all it's content gets discarded, but reading continues until the end of line. On the next line, reading continues as usual. The line being too long could be replaced by a short note.

Imagine line 52 being too long to read:

 51   │ ...
 52 ! │ <line too long>
 53   │ ...

(the '!' would have the purpose of making clear that <line too long> is not actual content)

Pros:

Cons:

To not make bat hang forever, a second limit could be added. When that one is reached, bat would abort immediately.

What is your opinion on that approach?

deboard commented 8 months ago

Interseting. Right now I am looking at doing this, but that may be a better way, I'll see how your suggestion works. Because, just dont crash:

                    #[cfg(unix)]
                    {
                        let fname = String::from(path.to_string_lossy());
                        if fname == "/dev/zero" {
                            // /dev/zero seems to be a special case file for bat on unix.
                            // It is an empty infinate length char file.
                            // This makes it imposible to read in the first line.
                            // Reading in the first line is the first thing
                            // InputReader::new does.
                            // I hate hard coding things like this, hack...
                            // Files like /dev/tty cause bat to hang but
                            // not crash like /dev/zero.
                            return Err(format!("'{fname}' <EMPTY>.").into());
                        }
                    }
aarondill commented 8 months ago

if you choose to go the hard coding route, would it be possible to use std::os::unix::fs::MetadataExt.rdev to get the device number and check if it's 1:5 instead?

this would protect against files created using mknod, as well as permitting the (bad idea, but allowed) use case of /dev/zero not being the zero character file.

einfachIrgendwer0815 commented 8 months ago

if you choose to go the hard coding route, would it be possible to use std::os::unix::fs::MetadataExt.rdev to get the device number and check if it's 1:5 instead?

That should be reliable (https://www.kernel.org/doc/html/latest/admin-guide/devices.html), but this is a very specialized solution. There are other such files, /dev/random (1:8), /dev/urandom (1:9), or any symbolic link to one of these.

To my understanding, the title of this issue and the author's initial comment just point out a very extreme case of the actual problem. In fact, you don't need an infinite file to let bat run out of memory, it's enough to use a regular file that is bigger than the available memory. Therefore, this problem is not limited to Linux and not to just a handful of special files. That's why I think this issue should be tackled with a more general and platform independent solution.

deboard commented 8 months ago

It would be better to have the fix based on the actual memory problem rather than special cases. I'll keep poking at it.

deboard commented 7 months ago

Is there a requirement or is it the desired behaviour to page binary data? If not this is a possible fix:

        // try to read in 100 bytes, see what the content type is
        let mut hundred_bytes = vec![0; 100 as usize];
        reader.read(&mut hundred_bytes).ok();
        let content_type = if hundred_bytes.is_empty() {
            None
        } else {
            Some(content_inspector::inspect(&hundred_bytes))
        };

        let mut first_line = vec![];

        match content_type {
            Some(ContentType::BINARY) => {
                // NO-OP on binary data
                // /dev/zero falls under this category
                // So does a .jpeg file...:w
                // 
            },
            Some(ContentType::UTF_8) |
            Some(ContentType::UTF_8_BOM) => {
                reader.read_until(b'\n', &mut first_line).ok();
            },
            Some(ContentType::UTF_16LE) |
            Some(ContentType::UTF_16BE)  |
            Some(ContentType::UTF_32LE) |
            Some(ContentType::UTF_32BE) |
            None => /* Not sure about reading on None... */ {
                reader.read_until(0x00, &mut first_line).ok();
            },
        };
deboard commented 7 months ago

Actually, I think I can make an improvement on the code above.

deboard commented 7 months ago

I am seeing an issue with with the idea of creating the InputReader from any "OrdinaryFile", just read in a "first line" and expect everything will be OK. More analysis on the files attributes and its content type need to happen before reading in a first line. For example, some files do not have a "line" in the text file sort of context.