Open Zackaryia opened 1 year ago
Note that most zst-compressed files are going to store a single frame (for optimal compression ratio). If you want to concatenate several frames and need random access to them, you may need something more similar to an archive format (like tar, or even zip, where some implementations include zstd compression).
I was thought that flush()
on Encoder
will force a new frame... but it does not.
Here is my failed attempt.
let start = std::time::Instant::now();
// A vector of tuples to store the offsets and lengths of each string
let mut index = Vec::new();
// A cursor to write the compressed data to
let mut cursor = Cursor::new(Vec::new());
let mut zstd_encoder = zstd::stream::write::Encoder::new(&mut cursor, 9)?;
let mut line_number = 0;
for line in s.lines() {
zstd_encoder.multithread(10);
let offset_start = zstd_encoder.get_ref().position();
zstd_encoder.write_all(line.as_bytes())?;
zstd_encoder.flush()?;
let offset_end = zstd_encoder.get_ref().position();
index.push((offset_start, offset_end));
// Print the offset and length of each line
println!("offset: {}, length: {}", offset_start, offset_end);
}
zstd_encoder.finish().unwrap();
// Get the compressed data from the cursor
let compressed_bundle = cursor.into_inner();
println!(
"The length of the compressed data is: {}",
compressed_bundle.len()
);
let duration = start.elapsed();
splitted_time += duration.as_millis();
println!(
"Result of splitted compression {} lines: {}MB in {:?}",
index.len(),
compressed_bundle.len() as f64 / 1024.0 / 1024.0,
splitted_time as f64 / 1000.0
);
// now lets decompress line number 0
let line = 0;
let offset_start = index[line].0 as usize;
let offset_end = (index[line].1) as usize;
println!(
"from {}, to {} of {}",
offset_start,
offset_end,
compressed_bundle.len()
);
let mut file_chunk = &compressed_bundle[offset_start..offset_end];
// Print length of file chunk
println!("Length of file chunk: {}", file_chunk.len());
// Create a reader for the compressed data with offsets and lengths
// Create a decoder
let decoder = zstd::stream::Decoder::new(&mut file_chunk)?;
// Read the decompressed data into a string
let reader = BufReader::new(decoder);
for line in reader.lines() {
let line = line.unwrap();
// Print the decompressed data
println!("Decompressed data: {}", line);
// exit after first line
break;
}
thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: UnexpectedEof, error: "incomplete frame" }'
So it seams like it is impossible...
However, there are discussions.
I'm have the same issue. In other compression libraries that implement Write
on their Encoder, flush
completes a frame so that you know that what's written can be incrementally decoded, e.g. in flate2:
https://docs.rs/flate2/1.0.28/src/flate2/gz/write.rs.html#148
I'd love to be able to use zstd
for its improved efficiency, but I really need to be able to do ensure that what I have written at certain checkpoints can be fully decoded.
Update: I got around this by switching to facebook/zstd + C + Emscripten. I am now using it in my own project and able to compress/decompress without needing to send all chunks.
I really need this. I thought I was doing something wrong, seems like after so much debugging I wasn't. I am building a peer-to-peer file transfer app where I wanted integrate zstd compression. I am doing compression while streaming and on the other side I was decompressing those chunks. However, I always got "incomplete frame" error no matter what I've tried. I thought about compressing each chunk individually and I don't have enough knowledge but I guess this would result in very poor compression ratio, right?
Is this a bad approach with zstd? Should I use a different algorithm, can someone enlighten me on this?
In Zstd, to my knowledge, a frame is completely individually decompress able, this leads me to the question is there a way to easily manually set where a frame starts / ends and also set what data goes in which frame, for example (psuedo code)
This way you could do
This would be useful in order to be able to retrieve data from a file without needing to decompress the whole file, also having the byte offsets of an individual frame to just read / decompress that one frame at a later point in time would be extremely useful in order to retrieve data from a large file.