steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
77 stars 7 forks source link

Input format #4

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi, I'm a little confused about what data you want in the input .fasta for running DNA contamination estimation. I have a draft genome assembly that I want to check for contamination and there are thousands of contigs, what should I put in the .fasta and .mapping files?

martin-steinegger commented 4 years ago

@jaydonhansen conterminator is build to do an all against all comparison of databases (e.g. like GTDB, Genbank, Refseq or customs sets). It was not build to just scan one genome at a time.

zagorGit commented 4 years ago

So how can one prepare mapping file for this custom set?

ghost commented 4 years ago

So how can one prepare mapping file for this custom set?

As per the example mapping file, it appears that you need to make a tab-limited table of the ID in each FASTA header with its corresponding taxid. So in the example they had

Chromosome 562 Chromosome 562 Human-real1 9606 Human-real2 9606 Virus 2202649

where 'Chromosome', 'Chromosome', 'Human-real1' and 'Human-real2' and 'Virus' were the sequence IDs in the FASTA headers, and the taxids were 562 (E. coli), 9606 (human) and a 2202649 (a viral protein) respectively. This will need to be done for each sequence in the input file.

martin-steinegger commented 4 years ago

Thank you @zagorGit and @jaydonhansen. The documentation still needs some work but I improved it. I have added the following description for the mapping file to the README. Does this answer your question?

Conterminator needs a mapping file, which assigns each fasta identifier to a taxonomical identifier. The mapping file consists of two tab-delimited columns, (1) fasta identifier and (2) [NCBI taxonomy identifier] (taxonomy ID) (https://www.ncbi.nlm.nih.gov/taxonomy). By default, Conterminator takes the text up to the first blank space as the fasta identifier. However, with GenBank, Tremble, Swissprot, Conterminator extracts out only the unique identifier mapped to the taxonomy ID.

Example for detecting contamination in the NT database:

blastdbcmd -db nt -entry all > nt.fna
blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping
conterminator dna nt.fna nt.fna.taxidmapping nt.result tmp
zagorGit commented 4 years ago

Yes, thank you.