grp-bork / gunc

Python package for detection of chimerism and contamination in prokaryotic genomes.
GNU General Public License v3.0
66 stars 8 forks source link

Identify contaminant contigs #14

Closed mw55309 closed 3 years ago

mw55309 commented 3 years ago

Hello

I am confused by the output of gunc - I thought it would be able to identify those contigs which do not match with the rest of the genome - can gunc not do that?

Or at least I thought it would be able to tell me the taxonomic assignments of each contig so I could make the decision myself - does gunc not do this?

It certainly looks from the visualisation (https://grp-bork.embl-community.io/gunc/_images/GUNC_PLOT_example.png) that gunc is able to label contigs - can I get those labels out as a text file?

Thanks Mick

defleury commented 3 years ago

Hi Mick!

I am confused by the output of gunc - I thought it would be able to identify those contigs which do not match with the rest of the genome - can gunc not do that?

We are currently working on this as a feature. Our aim is to have an automated 'chopping away' of problematic contigs, but we want to do it properly and benchmark before letting lose a tool that wreaks havoc with your previous MAGs...

Or at least I thought it would be able to tell me the taxonomic assignments of each contig so I could make the decision myself - does gunc not do this?

It certainly looks from the visualisation (https://grp-bork.embl-community.io/gunc/_images/GUNC_PLOT_example.png) that gunc is able to label contigs - can I get those labels out as a text file?

The visualisation module uses a heuristic to limit the number of displayed contigs (to avoid cluttering). But we'll add the option to get flat files with per-contig tax labels – basically, all labels assigned to a given contig at a taxonomic level, plus frequency. This would still require some further parsing to do what you intend, but it's the least biased output we can provide. This will come as a new option to gunc run, called --contig_taxonomy_output. @fullama is working to release this feature asap :-)

mw55309 commented 3 years ago

Thank you!

I guess in the meantime I can parse the diamond file to add taxonomic labels per protein and then use whatever summarisation algorithm I need to get per-contig taxonomies

Cheers Mick

fullama commented 3 years ago

Hi, I just released GUNC v1.0.4... (available via pip now and conda in a little while when it goes through)

you can now use gunc with the --contig_taxonomy_output option which hopefully gives you the kind of output you are looking for..

it will out put a tsv of the form:

contig tax_level assignment count_of_genes_assigned
k141_21019_1 kingdom 2 Bacteria 1
k141_21019_1 phylum 200795 Chloroflexi 1
k141_21019_1 family 475964 Caldilineaceae 1
k141_21019_1 genus 233191 Caldilinea 1

Any questions just let us know!

mw55309 commented 3 years ago

edit

I see there are 4 columns and the fourth is space separated!

OK!

Hello!

Thank you so much for doing this!

One small thing - in the output, some lines have 5 columns and others only have four..... e.g. species here:

contig  tax_level       assignment      count_of_genes_assigned
single_ERR2027929.96_k87_3848   kingdom 2 Bacteria      4
single_ERR2027929.96_k87_3848   phylum  1239 Firmicutes 4
single_ERR2027929.96_k87_3848   family  186803 Lachnospiraceae  3
single_ERR2027929.96_k87_3848   family  31979 Clostridiaceae    1
single_ERR2027929.96_k87_3848   genus   841 Roseburia   2
single_ERR2027929.96_k87_3848   genus   1855714 Anaerobium      1
single_ERR2027929.96_k87_3848   genus   1485 Clostridium        1
single_ERR2027929.96_k87_3848   species specI_v3_07704  1
single_ERR2027929.96_k87_3848   species specI_v3_08779  1
single_ERR2027929.96_k87_3848   species specI_v3_11370  1
single_ERR2027929.96_k87_3848   species specI_v3_10000  1

Can this be fixed somehow? :-D

Cheers Mick