patrickwest / EukRep

Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets
MIT License
66 stars 12 forks source link

minimum contig length for confident classification result? #6

Open jzrapp opened 4 years ago

jzrapp commented 4 years ago

Hi @patrickwest,

did you play with sequence length and have a recommendation for minimum contig length requirements that still allow a confident classification in either euk or prokaryote? Thanks a lot!

Best, Josephine

jzrapp commented 4 years ago

And as a follow up: Turns out that lots of the contigs that have been classified as "euk" contain structural phage genes.. so these might be viral contigs? Would you recommend to set -m to "strict" and compare? Have you ever dealt with viral signals in your data?

patrickwest commented 4 years ago

Hi Josephine,

Its hard to make a specific minimum contig length recommendation because it will depend on your assembly quality and your tolerance for false positives. That being said, a minimum contig length of 1kb should still give reasonable results and may be necessary if your assembly is highly fragmented. 5kb is currently the default because it often represents a good balance of accuracy without eliminating too much of the assembly.

Thats interesting, viruses have so far not been included in the training or test datasets. I would generally trust homology based classifications over kmer based classifications (ie EukRep) so those do sound like likely phage contigs.

jzrapp commented 4 years ago

Thanks a lot, Patrick! I realized that I hadn't read your manuscript thoroughly enough and found the figure that shows that 1kb should be a good minimum cutoff only after asking this question. I'm actually not trying to bin eukaryotes, but remove eukaryotic genes from my analysis - more like removing contaminants. I'm working on a dataset that includes two different sample types - one with heavy eukaryote contribution and one without. We prefiltered the samples, but based on scaffold taxonomy still a large amount of scaffolds appears to be eukaryotic (and viral). Digging through literature, I couldn't really find good examples of how to deal with this, but I'm worried that their presence might skew functional comparisons on a community level. Any thoughts on this from the Banfield lab? :)