paolobarbolini / bzip2-rs

Pure Rust bzip2 decoder
Apache License 2.0
44 stars 6 forks source link

Add support for seek-bzip style indexing and decompressing individual blocks #6

Open hippietrail opened 3 years ago

hippietrail commented 3 years ago

Bzip2 files can be randomly accessed if you can index which compressed bit offsets map to which uncompressed byte offsets.

It seems that the original seek-bzip was on BitBucket and was taken down upon the author's untimely death last year. But there is a fork on GitHub: https://github.com/cscott/seek-bzip and also a JavaScript implementation: https://github.com/galaxyproject/seek-bzip2

I've tried looking at the code but I'm too new to rust. I guess it can already decompress a single block or a range of blocks pretty trivially. I was looking at the test code in block/mod.rs

For supporting bzip-table, we need to do most of the decompression of each block the output the a mapping from the bit offset where each block begins in the compressed file, to either the byte offset of the length in bytes of that block in the raw file. Since we're throwing away the decompressed blocks as we're only interested in their lengths, there's probably a shortcut possible where we don't have to fully decompress the block.

paolobarbolini commented 3 years ago

Thanks for the detailed explaination. I didn't realize seeking bzip2 decompression was possible, and would indeed be very interesting for this crate to implement for a future release.

I have some parts of the block decoder I'd like to refactor, making it easier to implement features like this one, so we could get back to this after I get a chance to do that.

hippietrail commented 3 years ago

No worries. It's not that widely known but there are quite a few projects that use this functionality that I found last time I looked around. I might put a feature request in on the C FFI bzip2 module as well but yours is faster - thanks for making it. Random access is possible in gzip too but in a quite different way, but I haven't looked at the Rust gzip crates yet.

hippietrail commented 3 years ago

Here's my quick attempt at adding bzip-table as a test in src/decoder/block/mod.rs

I'm not good enough yet at either Rust or the codebase to integrate it better or to shortcut getting just the length of the output block without fully decompressing it, assuming that is possible. It tests the first few entries of the current latest English Wiktionary dump file (non-multistream) here: https://dumps.wikimedia.org/enwiktionary/20210701/

It's quite slow as a debug build but as a release build I think it's already faster than existing bzip-table implementations! (Actually I had to edit a couple of times. Seems I broke it between testing it and posting it here. Oops!)

    use std::fs::File;
    use std::io::Read;
    use std::io::BufReader;

    #[test]

    fn bziptable() {
        let file = File::open("../Downloads/enwiktionary-20210701-pages-articles.xml.bz2").unwrap();
        let mut br = BufReader::new(file);
        let mut compressed = vec![0; 1024 * 1024 * 818];
        br.read_exact(&mut compressed).unwrap();

        let header = Header::parse(compressed[..4].try_into().unwrap()).unwrap();
        println!("SEEK BZIP BLOCK SIZE: {}", header.raw_blocksize());

        let compressed = &compressed[4..];

        let mut bits = BitReader::new(&compressed);
        let mut reader = Block::new(header);

        let mut out = vec![0u8; 1024 * 1024 * 818];

        let mut table: Vec<(usize, usize)> = vec!();

        let mut i = 0;

        loop {
            //if i == 1000 { break }

            let bitpos_before = bits.position();
            reader.set_ready_for_read();
            if let Ok(bytelen) = reader.read(&mut bits, &mut out) {
                println!("SEEK BZIP TABLE: {}\t{}", 32 + bitpos_before, bytelen);
                table.push((32 + bitpos_before, bytelen));
            } else {
                break;
            }
            i += 1;
        }

        let expected: [(usize, usize); 10] = [
            (32, 899219),
            (1885718, 899712),
            (3785876, 901449),
            (5313257, 900445),
            (7093243, 900820),
            (8407991, 897722),
            (10199906, 899613),
            (12034482, 898887),
            (13990588, 899440),
            (15907858, 899643),
        ];
        assert_eq!(&table[..10], expected);
    }