steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
77 stars 7 forks source link

Removing sequences based on conterminator output #14

Closed chassenr closed 3 years ago

chassenr commented 3 years ago

Hi @martin-steinegger , sorry to bother you with this. I am wondering about how best to remove the sequences that were flagged as contaminated in a set of genome assemblies. Would you remove the whole contig or just the section between the alignment start and end positions in the {RESULT_PREFIX}_conterm_prediction file thereby splitting the sequence into multiple sections? How do you suggest to deal with contamination in scaffolds, e.g. extract and remove the contaminated contig thereby splitting the scaffold into multiple sections?

Thanks!

martin-steinegger commented 3 years ago

Hi @chassenr

sorry again for this really late answer. I think most contigs that are flagged will be short -> just remove them. If the contamination was scaffolded in a genome then it should be surrounded by Ns (Ns normally indicate scaffolding boundaries). In this case I would just remove everything in between the Ns.

Contaminated:
 ATA....TGANTAGTRAGTARNT...GCTA
 region1    conterm    region2

Clean:
ATA....TGANT...GCTA
chassenr commented 3 years ago

Thanks @martin-steinegger for confirming my gut feeling on this. This is also the strategy that I implemented in the meantime.