jorgecarleitao / parquet2

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Other
356 stars 59 forks source link

(fuzzing) infinite loop when reading corrupted file #185

Closed evanrichter closed 2 years ago

evanrichter commented 2 years ago

my fuzzer found a timeout that seemed like the Reader was not making progress. Debugging the root cause of that is difficult, usually involving perf and flamegraph to see what is executing. So I updated the harness to detect when Read or Seek operations have not changed the cursor for 8 times in a row, and turned that into a panic with a nice backtrace:

Running: artifacts/parse_parquet/minimized-from-b7067ad7c9c1eccf07c13b9c4bc7e2632a81b52f
thread '<unnamed>' panicked at 'no progress is being made! stuck at position 8294', fuzz_targets/parse_parquet.rs:57:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
==514077== ERROR: libFuzzer: deadly signal
    #16 0x55e0bed57a3a in parse_parquet::FuzzInput::update_pos::hdb1533ba951fdc69 parquet2/fuzz/fuzz_targets/parse_parquet.rs:49:13
    #17 0x55e0beceee91 in _$LT$parse_parquet..FuzzInput$u20$as$u20$std..io..Read$GT$::read::had9c967c82aff563 parquet2/fuzz/fuzz_targets/parse_parquet.rs:57:9
    #18 0x55e0beceee91 in std::io::default_read_exact::h6c7beff976e5d5c9 /rustc/29e4a9ee0253cd39e552a77f51f11f9a5f1c41e6/library/std/src/io/mod.rs:447:15
    #19 0x55e0beceee91 in std::io::Read::read_exact::h030b98de46ab9674 /rustc/29e4a9ee0253cd39e552a77f51f11f9a5f1c41e6/library/std/src/io/mod.rs:801:9
    #20 0x55e0beceee91 in std::io::impls::_$LT$impl$u20$std..io..Read$u20$for$u20$$RF$mut$u20$R$GT$::read_exact::h6de2fa17a867acf8 /rustc/29e4a9ee0253cd39e552a77f51f11f9a5f1c41e6/library/std/src/io/impls.rs:50:9
    #21 0x55e0beceee91 in std::io::impls::_$LT$impl$u20$std..io..Read$u20$for$u20$$RF$mut$u20$R$GT$::read_exact::hf247100c3f89b399 /rustc/29e4a9ee0253cd39e552a77f51f11f9a5f1c41e6/library/std/src/io/impls.rs:50:9
    #22 0x55e0beceee91 in _$LT$parquet_format_safe..thrift..protocol..compact..TCompactInputProtocol$LT$R$GT$$u20$as$u20$parquet_format_safe..thrift..protocol..TInputProtocol$GT$::read_byte::h0ac58e010e50e0bc /home/evan/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-format-safe-0.2.3/src/thrift/protocol/compact.rs:296:9
    #23 0x55e0bece934e in _$LT$parquet_format_safe..thrift..protocol..compact..TCompactInputProtocol$LT$R$GT$$u20$as$u20$parquet_format_safe..thrift..protocol..TInputProtocol$GT$::read_field_begin::h5aa01a533b92a341 /home/evan/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-format-safe-0.2.3/src/thrift/protocol/compact.rs:150:26
    #24 0x55e0bec89822 in _$LT$parquet_format_safe..parquet_format..PageHeader$u20$as$u20$parquet_format_safe..thrift..protocol..ReadThrift$GT$::read_from_in_protocol::h6514bce0c219fea8 /home/evan/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-format-safe-0.2.3/src/parquet_format.rs:5359:25
    #25 0x55e0bed524bf in parquet_format_safe::parquet_format::PageHeader::read_from_in_protocol::hb133d86822c81bf1 /home/evan/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-format-safe-0.2.3/src/parquet_format.rs:5252:5
    #26 0x55e0bed524bf in parquet2::read::page::reader::read_page_header::h0ca07523446e3c38 parquet2/src/read/page/reader.rs:174:23
    #27 0x55e0bed50d6c in parquet2::read::page::reader::build_page::h96ed3f97ee03913b parquet2/src/read/page/reader.rs:194:23
    #28 0x55e0bec6e9bc in parquet2::read::page::reader::next_page::h6e69bffb04b1f299 parquet2/src/read/page/reader.rs:187:5
    #29 0x55e0bec6e9bc in _$LT$parquet2..read..page..reader..PageReader$LT$R$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::next::hbdc3275901bc08e3 parquet2/src/read/page/reader.rs:150:32
    #30 0x55e0bed59697 in parse_parquet::fuzz::h8a7951a5c6906cd8 parquet2/fuzz/fuzz_targets/parse_parquet.rs:100:25
    #31 0x55e0bed5c49b in rust_fuzzer_test_input

I'm not sure if this is a side-effect of fuzzing with mocked I/O (e.g. std::io::Cursor or my FuzzInput) and if a real std::fs::File implements read and read_exact in a way that allows parquet2 to terminate here.

Initial tests with the following read_exact produce infinite EOF! printed to the screen:

impl Read for FuzzInput<'_> {
    fn read_exact(&mut self, buf: &mut [u8]) -> std::io::Result<()> {
        if self.remaining_slice().len() < buf.len() {
            println!("EOF!");
            return Err(std::io::Error::new(
                std::io::ErrorKind::UnexpectedEof,
                "EOF",
            ));
        }

        self.read(buf).map(unit)
    }
}

Test case attached (not really a txt file, use with cargo-fuzz on my fork): minimized-from-b7067ad7c9c1eccf07c13b9c4bc7e2632a81b52f.txt

jorgecarleitao commented 2 years ago

Looking good!

I think that the issue is these lines on the fuzzer:

for page in get_page_iterator(column_meta, &mut reader, None, Vec::new(), 16 * 1024)? {
    let mut decompress_buffer = Vec::new();
    if let Ok(page) = page {
        let _page = decompress(page, &mut decompress_buffer);
    }
}

I think we want the following:

for maybe_page in get_page_iterator(column_meta, &mut reader, None, Vec::new(), 16 * 1024)? {
    let mut decompress_buffer = Vec::new();
    let page = maybe_page?;
    let _page = decompress(page, &mut decompress_buffer)?;
}

so that we stop the iteration on an error (in either getting the page or decompressing the page (thus the two ?)

evanrichter commented 2 years ago

that was the issue, false alarm :sweat_smile: