kowallus / mbgc

Multiple Bacteria Genome Compressor (MBGC)
GNU General Public License v3.0
6 stars 1 forks source link

Request for info: Are compression ratios expected to be the same for "all genomes compressed at once" vs. "genomes added one by one"? #12

Open karel-brinda opened 10 months ago

kowallus commented 10 months ago

No, they aren't. It might be the case in max mode, but there is no such guarantee.

karel-brinda commented 10 months ago

Ok; understand. This is more for understanding in the max mode, what's the penalty for future extensions of archived datasets, or whether it's better to recompress them from scratch (eg when merging two batches in phylogenetic compression).

kowallus commented 10 months ago

Appending might result in a slightly better ratio.

karel-brinda commented 10 months ago

Interesting!! My intuition was completely opposite. Thanks for the info!

(btw. just out of interest – is it due to some speed-up heuristics based on large buffers that would be applied without going genome by genome?)

kowallus commented 10 months ago

Append also repacks all mbgc streams, but new genomes are compressed in the full context of the original archive. Repacking builds the context online from scratch. max mode compression is fully sequential so a context for each genome contains info from all preceding genomes, and therefore appending should not influence the ratio much.