miss binning of contigs

patrickwest / EukRep

Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets

MIT License

66 stars 12 forks source link

miss binning of contigs #11

Closed AmaliT closed 4 years ago

AmaliT commented 4 years ago

I am trying to use EukRep pipeline for separation of Eukaryotes and prokaryotes prior to binning. I ran eukrep on default with a minimum contig size cut-off of 1kb, however with a quick similarity based classification of the two bins (prokaryotes & eukayrote.fa) - I am seeing lots of eukaryote contigs in the prokaryote fasta file (Similar to mentioned in #5 ). So I was wondering how I could improve this results - would change of -m or -k help? Also could you please explain the differences among the 3 models - strict, balance (default) and lenient. I couldn't find much detail on these on the publication/documentation.

Thanks heaps in advance

Cheers Amali

patrickwest commented 4 years ago

Hi Amali,

Thanks for your interest. Increasing -k will always help, at the expense of longer runtimes.

For the -m option, it depends a bit on what you mean by improve. There are three sets of trained models where the class weights have been modified to bias classification either for or against eukaryotic classification. Lenient is more biased towards eukaryotic classification while strict is biased against eukaryotic classification. Its important to know that its a trade-off however; because, with the lenient model for example, you will likely get more true positive eukaryotic scaffolds but also more false positive eukaryotic scaffolds as well. I hope that helps clarify.

Patrick

AmaliT commented 4 years ago

Hi @patrickwest

Thanks for the reply. I tried increasing k to 6, but didn't see much of a change in results. Is the maximum for k is 7?

Thanks for the explanation of the 3 different models.

patrickwest commented 4 years ago

EukRep struggles with some genomes more than others, especially if the genome is heavily fragmented. In my experience binning is generally improved by combining multiple methods. I don't know what your use case is but if you're seeing better results by combining with similarity based classification (I also don't know what method you're using here but I might expect miss binned contigs from this as well) then you should probably run with that. I would eventually try and check with phylogenetic signal where you can however.