Closed mw55309 closed 3 years ago
Hi Mick!
I am confused by the output of gunc - I thought it would be able to identify those contigs which do not match with the rest of the genome - can gunc not do that?
We are currently working on this as a feature. Our aim is to have an automated 'chopping away' of problematic contigs, but we want to do it properly and benchmark before letting lose a tool that wreaks havoc with your previous MAGs...
Or at least I thought it would be able to tell me the taxonomic assignments of each contig so I could make the decision myself - does gunc not do this?
It certainly looks from the visualisation (https://grp-bork.embl-community.io/gunc/_images/GUNC_PLOT_example.png) that gunc is able to label contigs - can I get those labels out as a text file?
The visualisation module uses a heuristic to limit the number of displayed contigs (to avoid cluttering). But we'll add the option to get flat files with per-contig tax labels – basically, all labels assigned to a given contig at a taxonomic level, plus frequency. This would still require some further parsing to do what you intend, but it's the least biased output we can provide. This will come as a new option to gunc run
, called --contig_taxonomy_output
. @fullama is working to release this feature asap :-)
Thank you!
I guess in the meantime I can parse the diamond file to add taxonomic labels per protein and then use whatever summarisation algorithm I need to get per-contig taxonomies
Cheers Mick
Hi, I just released GUNC v1.0.4... (available via pip now and conda in a little while when it goes through)
you can now use gunc with the --contig_taxonomy_output
option which hopefully gives you the kind of output you are looking for..
it will out put a tsv of the form:
contig | tax_level | assignment | count_of_genes_assigned |
---|---|---|---|
k141_21019_1 | kingdom | 2 Bacteria | 1 |
k141_21019_1 | phylum | 200795 Chloroflexi | 1 |
k141_21019_1 | family | 475964 Caldilineaceae | 1 |
k141_21019_1 | genus | 233191 Caldilinea | 1 |
Any questions just let us know!
edit
I see there are 4 columns and the fourth is space separated!
OK!
Hello!
Thank you so much for doing this!
One small thing - in the output, some lines have 5 columns and others only have four..... e.g. species here:
contig tax_level assignment count_of_genes_assigned
single_ERR2027929.96_k87_3848 kingdom 2 Bacteria 4
single_ERR2027929.96_k87_3848 phylum 1239 Firmicutes 4
single_ERR2027929.96_k87_3848 family 186803 Lachnospiraceae 3
single_ERR2027929.96_k87_3848 family 31979 Clostridiaceae 1
single_ERR2027929.96_k87_3848 genus 841 Roseburia 2
single_ERR2027929.96_k87_3848 genus 1855714 Anaerobium 1
single_ERR2027929.96_k87_3848 genus 1485 Clostridium 1
single_ERR2027929.96_k87_3848 species specI_v3_07704 1
single_ERR2027929.96_k87_3848 species specI_v3_08779 1
single_ERR2027929.96_k87_3848 species specI_v3_11370 1
single_ERR2027929.96_k87_3848 species specI_v3_10000 1
Can this be fixed somehow? :-D
Cheers Mick
Hello
I am confused by the output of gunc - I thought it would be able to identify those contigs which do not match with the rest of the genome - can gunc not do that?
Or at least I thought it would be able to tell me the taxonomic assignments of each contig so I could make the decision myself - does gunc not do this?
It certainly looks from the visualisation (https://grp-bork.embl-community.io/gunc/_images/GUNC_PLOT_example.png) that gunc is able to label contigs - can I get those labels out as a text file?
Thanks Mick