Closed chassenr closed 3 years ago
Hi @chassenr
sorry again for this really late answer. I think most contigs that are flagged will be short -> just remove them.
If the contamination was scaffolded in a genome then it should be surrounded by N
s (Ns normally indicate scaffolding boundaries). In this case I would just remove everything in between the Ns.
Contaminated:
ATA....TGANTAGTRAGTARNT...GCTA
region1 conterm region2
Clean:
ATA....TGANT...GCTA
Thanks @martin-steinegger for confirming my gut feeling on this. This is also the strategy that I implemented in the meantime.
Hi @martin-steinegger , sorry to bother you with this. I am wondering about how best to remove the sequences that were flagged as contaminated in a set of genome assemblies. Would you remove the whole contig or just the section between the alignment start and end positions in the {RESULT_PREFIX}_conterm_prediction file thereby splitting the sequence into multiple sections? How do you suggest to deal with contamination in scaffolds, e.g. extract and remove the contaminated contig thereby splitting the scaffold into multiple sections?
Thanks!