jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
348 stars 81 forks source link

Using dedupe.sh to identify and remove contigs contained in others (merged mode) #701

Open eperezv opened 1 year ago

eperezv commented 1 year ago

Hi,

I'm trying to run SqueezeMeta in Merged mode because my data is too big to allow coassembly. When running cd-hit, I realized it was also too slow so I thought of running dedupe.sh with a minimum identity of 99.

I was wondering if this alternative, which is faster than cd-hit, and I think it does the same, would be suitable to run squeezemeta in merged mode.

Cheers

fpusan commented 1 year ago

It seems like it could be a valid alternative, and I am actually quite interested in seeing how it works for you. Please keep us posted!

eperezv commented 1 year ago

I was able to run dedupe.sh, which finished in around 10 min when cd-hit was taking days (and didn't finish). It actually identified and removed contigs that were identical or contained in others, removing ca. 30% of the sequences. I am now running minimus2, but it will take a lot probably.