steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
79 stars 7 forks source link

fail to predict inserted contamination #11

Open felipevzps opened 4 years ago

felipevzps commented 4 years ago

Hello!

I did a synthetic genome to check the outputs and the conterminator failed to predict inserted contaminants.

Infos: Version: 1.c74b5 Organisms in this synthetic genome: Saccharum hybrid cultivar SP80-3280, Klebsiella pneumoniae and Acinetobacter baumannii.

History I inserted the complete A.baumanii and K.pneumoniae genome into the sugarcane genome and created a kraken mapping file (when I checked the mapping file, I could see the ID taxonomy of the inserted items - A.baumani ID = 470, K.pneumoniae ID = 573 and SP80-3280 ID = 193079).

Then, I ran the conterminator with the following command: conterminator dna synthetic_genome.fasta kraken_mapping_file.txt synthetic_genome_conterminator tmp

Results The synthetic_genome_conterminator_conterm_prediction is empty. The synthetic_genome_conterminator_all don't have informations of the inserted contaminants.

Data synthetic_genome_conterminator_all.txt kraken_mapping_file.txt Genome file is too big and the conterm_prediction is empty.

Problem My objective is to observe contamination in the sugarcane genome. I'm using the conterminator incorrectly or is the conterminator failing to predict contamination?

martin-steinegger commented 4 years ago

We currently predict contamination just for shore sequences of length < 20kb. The 20kb can be in scaffolds or just single sequences. I assume you have just one long sequence?

donovan-h-parks commented 2 years ago

@martin-steinegger Is there a way to indicate that contamination should be reported for longer sequences? I'm trying to reproduce the example between C. elegans and E. coli in your ms.

martin-steinegger commented 2 years ago

The _all report should contain all the local alignments with cross kingdom hits (--kingdom). This could be used to filter for longer sequences. Can you find the C.elegans and E.coli in it? The format is like the following:

1.) Numeric identifier
2.) Sequence identifier
3.) Alignment start
4.) Alignment end
5.) Corrected contig length (length between flanking Ns)
6.) Total sequence length
7.) Kingdom (default: 0: Bacteria&Archaea, 1: Fungi, 2: Metazoa, 3: Viridiplantae, 4: Other Eukaryotes)
8.) Species name 
donovan-h-parks commented 2 years ago

There are indeed expected hits in the _all file. Is it possible to make the 20 kb filtering criterion an exposed parameter? This would also help document to users that such a criterion exists.

martin-steinegger commented 2 years ago

Yes, I agree. I had this on my todo list for quite some time. :( But currently I am quite flooded with work.