Weirdness in unknown taxonomy

limey-bean / Anacapa

Written by Emily Curd (eecurd@g.ucla.edu), Jesse Gomer (jessegomer@gmail.com), Gaurav Kandlikar (gkandlikar@ucla.edu), Zack Gold (zjgold@ucla.edu), Max Ogden (max@maxogden.com), Lenore Pipes (lpipes@berkeley.edu)and Baochen Shi (biosbc@gmail.com). Assistance was provided by Rachel Meyer (rsmeyer@ucla.edu).

MIT License

42 stars 19 forks source link

Weirdness in unknown taxonomy #33

Closed gauravsk closed 6 years ago

gauravsk commented 6 years ago

I'm confused by the behavior on unknown taxonomy:

what's the difference between "NA;NA;NA;NA;NA;NA" and just "" in sum.taxonomy?
I guess we are just at the mercy of ncbi's db/Entrez qiime here, but I find taxonomic calls like Arthropoda;Insecta;Anthoathecata;Hydractiniidae;NA;Podocoryna carnea hard to interpret as a user- and also when doing biom comparison stuff. Why isn't Podocoryna being listed as genus? We may find the r package taxize useful:

library(taxize)
classification("Podocoryna carnea", db = "ncbi")

Retrieving data for taxon 'Podocoryna carnea'

$`Podocoryna carnea`
                 name         rank     id
1  cellular organisms      no rank 131567
2           Eukaryota superkingdom   2759
3        Opisthokonta      no rank  33154
4             Metazoa      kingdom  33208
5           Eumetazoa      no rank   6072
6            Cnidaria       phylum   6073
7            Hydrozoa        class   6074
8        Hydroidolina     subclass  37516
9       Anthoathecata        order 406427
10           Filifera     suborder 406428
11     Hydractiniidae       family   6094
12         Podocoryna        genus   6095
13  Podocoryna carnea      species   6096

@jessegomer @limey-bean

limey-bean commented 6 years ago

That is a very good question @gauravsk. I agree that the lack of genus, particularly if it is in the "genus species" is a huge problem. We can add taxize into Crux, but it would take some work. Because we are filtering reads by version accession number, we could just grab "genus species" with enter_qiime.py and then run that file through taxize (unless taxize accepts accession version numbers). We would then just need to grab super kingdom, phylum, class, order, family, genus and species. We don't have any R in the CRUX scripts but it could be fun!

gauravsk commented 6 years ago

agreed that doing this in R rather than CRUX might be the way to go. It looks like taxize might work with accession numbers but not sure. https://gist.github.com/sckott/a78e11dc624dd4342173#pass-the-uid-along-to-other-functions

limey-bean commented 6 years ago

That looks super promising. It certainly works with accession numbers, we can check if it works with version accession numbers. We could do this in place of enterz_qiime.py in crux. Is there an easy way to read in the fasta file, strip the accession (or accession version number), run it through taxize and pullout kingdom, phylum, class, order, family, genus and species, and then make a txt file that matches the current taxonomy file output?

gauravsk commented 6 years ago

Yeah, that should be doable- tbh not sure what is the best place to integrate it in. I'm not as familiar with the post-dada2 steps of Anacapa as I should be, maybe there's a way to integrate it in over there. Let's talk about it.

limey-bean commented 6 years ago

Well, it is a CRUX database problem for sure... See line 90 of the third CRUX script. If you had a pretty R script, we could drop it in there... I am around if you wanna chat. https://github.com/limey-bean/CRUX_Creating-Reference-libraries-Using-eXisting-tools/blob/master/crux_release_V1_db/crux_part3.sh

limey-bean commented 6 years ago

Ok, this is not a CRUX problem @jessegomer we have some BLCA stuff to check out...