grp-bork / gunc

Python package for detection of chimerism and contamination in prokaryotic genomes.
GNU General Public License v3.0
66 stars 8 forks source link

--use_species_level is not a better choice? #36

Closed nashanghenzan closed 1 year ago

nashanghenzan commented 1 year ago

the default setting of gunc is use phylum level as maxCSS. but i find that the phylum level is faulse in pass.GUNC, but the phylum level is true in pass.GUNC of same bin, so which result is more credible?

fullama commented 1 year ago

im not sure i understand the question, could you maybe post an example of what you mean?

nashanghenzan commented 1 year ago

just like the following file: genome n_genes_called n_genes_mapped n_contigs taxonomic_level proportion_genes_retained_in_major_clades genes_retained_index clade_separation_score contamination_portion n_effective_surplus_clades mean_hit_identity reference_representation_score pass.GUNC GHR_bin.3 1813 1745 384 kingdom 1.0 0.96 0.0 0.0 0.0 0.77 0.74 True GHR_bin.3 1813 1745 384 phylum 0.99 0.95 0.0 0.0 0.0 0.77 0.73 True GHR_bin.3 1813 1745 384 class 0.98 0.94 0.37 0.03 0.06 0.78 0.73 True GHR_bin.3 1813 1745 384 order 0.94 0.91 0.44 0.06 0.14 0.78 0.71 True GHR_bin.3 1813 1745 384 family 0.87 0.83 0.0 0.0 0.0 0.8 0.67 True GHR_bin.3 1813 1745 384 genus 0.85 0.82 0.48 0.14 0.32 0.8 0.66 False GHR_bin.3 1813 1745 384 species 0.66 0.64 0.07 0.56 2.33 0.83 0.53 True

fullama commented 1 year ago

so are you saying you have a different result when you use the --use_species_level option? or that you wonder why phylum is picked in the maxcss output file?

nashanghenzan commented 1 year ago

so are you saying you have a different result when you use the --use_species_level option? or that you wonder why phylum is picked in the maxcss output file?

I want to know which taxonomic level result of gunc should commonly used? species or phylum or others?

fullama commented 1 year ago

see results section https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02393-0
"As expected, GUNC accuracy was consistently high with the only exception of species-level chimerism where it performed suboptimally at lower portions of contamination (Additional file 1: Figure S3)."

this is why species level is off by default.

the normal output file is the one we would recommend to use.. see https://grp-bork.embl-community.io/gunc/output.html the detailed output is there only if you want more information