refresh-bio / agc

Assembled Genomes Compressor
MIT License
150 stars 13 forks source link

Thoughts about compressing unitigs? #3

Open rchikhi opened 1 year ago

rchikhi commented 1 year ago

Hi Sebastian, Agnieszka, Heng,

AGC looks great. I wanted to see if it'd work also on badly-assembled sequences, e.g. unitigs, and didn't get good compression ratios. Would you say the approach fundamentally wouldn't work for unitigs, or did I miss some parameter tweaks?

I tried to compress 2 human samples unitigs (NA06986 & NA06991) using CHM13v2 as reference, resulting in AGC filesize of 3.6 GB, which is more than the concatenation of the raw gzipped unitigs (2x1.7GB). Cmdline: \time ~/tools/agc/agc create -t 10 chm13v2.0.oneline.fa NA06986.unitigs.fa.gz NA06991.unitigs.fa.gz > NA06986_NA06991.agc. Testing with parameter -s 200 didn't substantially change results.

thanks in advance for any feedback, Rayan

sebastiandeorowicz commented 1 year ago

Hi Rayan, AGC was designed for high quality assemblies. Nevertheless, I'm a bit surprised that you report so bad ratios, so we have to take a look at this case. Definitely, we should be better than gzip. :-) I'll let you know when we will have any news. Best, Sebastian

arcadeo commented 1 year ago

AGC does look great! And perhaps I misunderstood, but I think the size difference is due to the AGC file including three genomes (i.e., ref + 2 unitig assemblies), not just two. So AGC would still effectively be smaller at 3.6GB than concatenating the three assemblies.