Creating self-contained compressed sequences.

Zackaryia commented 1 year ago

In Zstd, to my knowledge, a frame is completely individually decompress able, this leads me to the question is there a way to easily manually set where a frame starts / ends and also set what data goes in which frame, for example (psuedo code)

file.compress_write("Foo") // Frame 0 data
file.new_frame()
file.compress_write("Bar") // Frame 1 data

This way you could do

file.read_frame(0) // "Foo"
file.frame_offsets(0) // (123, 456)

This would be useful in order to be able to retrieve data from a file without needing to decompress the whole file, also having the byte offsets of an individual frame to just read / decompress that one frame at a later point in time would be extremely useful in order to retrieve data from a large file.

file_at_a_later_point_in_time.read(from_byte: 123, to_byte: 456).decompress() // "Lots of data!!"

gyscos commented 1 year ago

Note that most zst-compressed files are going to store a single frame (for optimal compression ratio). If you want to concatenate several frames and need random access to them, you may need something more similar to an archive format (like tar, or even zip, where some implementations include zstd compression).

Smotrov commented 1 year ago

I was thought that flush()on Encoder will force a new frame... but it does not. Here is my failed attempt.

 let start = std::time::Instant::now();
        // A vector of tuples to store the offsets and lengths of each string
        let mut index = Vec::new();

        // A cursor to write the compressed data to
        let mut cursor = Cursor::new(Vec::new());
        let mut zstd_encoder = zstd::stream::write::Encoder::new(&mut cursor, 9)?;
        let mut line_number = 0;
        for line in s.lines() {
            zstd_encoder.multithread(10);
            let offset_start = zstd_encoder.get_ref().position();

            zstd_encoder.write_all(line.as_bytes())?;
            zstd_encoder.flush()?;

            let offset_end = zstd_encoder.get_ref().position();

            index.push((offset_start, offset_end));
            // Print the offset and length of each line
            println!("offset: {}, length: {}", offset_start, offset_end);
        }
        zstd_encoder.finish().unwrap();
        // Get the compressed data from the cursor
        let compressed_bundle = cursor.into_inner();
        println!(
            "The length of the compressed data is: {}",
            compressed_bundle.len()
        );

        let duration = start.elapsed();
        splitted_time += duration.as_millis();

        println!(
            "Result of splitted compression {} lines: {}MB in {:?}",
            index.len(),
            compressed_bundle.len() as f64 / 1024.0 / 1024.0,
            splitted_time as f64 / 1000.0
        );

        // now lets decompress line number 0
        let line = 0;
        let offset_start = index[line].0 as usize;
        let offset_end = (index[line].1) as usize;
        println!(
            "from {}, to {} of {}",
            offset_start,
            offset_end,
            compressed_bundle.len()
        );
        let mut file_chunk = &compressed_bundle[offset_start..offset_end];
        // Print length of file chunk
        println!("Length of file chunk: {}", file_chunk.len());
        // Create a reader for the compressed data with offsets and lengths
        // Create a decoder
        let decoder = zstd::stream::Decoder::new(&mut file_chunk)?;
        // Read the decompressed data into a string
        let reader = BufReader::new(decoder);
        for line in reader.lines() {
            let line = line.unwrap();
            // Print the decompressed data
            println!("Decompressed data: {}", line);
            // exit after first line
            break;
        }

thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: UnexpectedEof, error: "incomplete frame" }'

So it seams like it is impossible...

However, there are discussions.

jadamcrain commented 7 months ago

I'm have the same issue. In other compression libraries that implement Write on their Encoder, flush completes a frame so that you know that what's written can be incrementally decoded, e.g. in flate2:

https://docs.rs/flate2/1.0.28/src/flate2/gz/write.rs.html#148

I'd love to be able to use zstd for its improved efficiency, but I really need to be able to do ensure that what I have written at certain checkpoints can be fully decoded.

unitythemaker commented 3 months ago

Update: I got around this by switching to facebook/zstd + C + Emscripten. I am now using it in my own project and able to compress/decompress without needing to send all chunks.

I really need this. I thought I was doing something wrong, seems like after so much debugging I wasn't. I am building a peer-to-peer file transfer app where I wanted integrate zstd compression. I am doing compression while streaming and on the other side I was decompressing those chunks. However, I always got "incomplete frame" error no matter what I've tried. I thought about compressing each chunk individually and I don't have enough knowledge but I guess this would result in very poor compression ratio, right?

Is this a bad approach with zstd? Should I use a different algorithm, can someone enlighten me on this?

gyscos / zstd-rs

Creating self-contained compressed sequences. #218