tskit-dev / tszip

Gzip-like compression for tskit tree sequences
https://tszip.readthedocs.io/
MIT License
4 stars 6 forks source link

Worse compression in recent versions? #87

Open hyanwong opened 2 months ago

hyanwong commented 2 months ago

I'm loading https://zenodo.org/records/5495535/files/hgdp_tgp_sgdp_chr20_p.dated.trees.tsz. When I decompress and then compress it again, with the latests tszip, I get a file which is about 20% larger than before.

-rw-r--r--@ 1 yan  staff    35M 13 Jul 19:24 hgdp_tgp_sgdp_chr20_p.dated.trees.tsz
-rw-r--r--@ 1 yan  staff    44M 13 Jul 19:25 hgdp_tgp_sgdp_chr20_p.dated.trees_recompressed.tsz

I seem to remember some explanation for why compression got worse at some point, but I can't find it.

jeromekelleher commented 2 months ago

That's curious. I don't remember any particular reason for this, and would need to look at the Zarrs to get any insights

benjeffery commented 2 months ago

Does the compression ratio change if you revert to an old zarr version?