ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
269 stars 61 forks source link

Missing children of Bacteria #660

Open arendsee opened 6 years ago

arendsee commented 6 years ago
Session Info ```r Session info ------------------------------------------------------------------ setting value version R version 3.4.3 (2017-11-30) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 tz America/Chicago date 2018-02-03 Packages ---------------------------------------------------------------------- package * version date source ape 5.0 2017-10-30 cran (@5.0) assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1) base * 3.4.3 2017-11-30 local bindr 0.1 2016-11-13 CRAN (R 3.4.1) bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1) bit 1.1-12 2014-04-09 CRAN (R 3.4.1) bit64 0.9-7 2017-05-08 CRAN (R 3.4.1) blob 1.1.0 2017-06-17 CRAN (R 3.4.1) bold 0.5.0 2017-07-21 CRAN (R 3.4.2) cli 1.0.0 2017-11-05 CRAN (R 3.4.3) codetools 0.2-15 2016-10-05 CRAN (R 3.4.1) colorout * 1.1-2 2017-09-23 Github (jalvesaq/colorout@020a14d) commonmark 1.4 2017-09-01 CRAN (R 3.4.1) compiler 3.4.3 2017-11-30 local crayon 1.3.4 2017-09-16 CRAN (R 3.4.1) crul 0.5.0 2018-01-22 cran (@0.5.0) curl 3.1 2017-12-12 cran (@3.1) data.table 1.10.4-3 2017-10-27 cran (@1.10.4-) datasets * 3.4.3 2017-11-30 local DBI 0.7 2017-06-18 CRAN (R 3.4.1) dbplyr 1.2.0 2018-01-03 cran (@1.2.0) devtools * 1.13.4 2017-11-09 CRAN (R 3.4.2) digest 0.6.13 2017-12-14 CRAN (R 3.4.3) dplyr * 0.7.4 2017-09-28 cran (@0.7.4) foreach 1.4.4 2017-12-12 CRAN (R 3.4.3) glue 1.2.0 2017-10-29 cran (@1.2.0) graphics * 3.4.3 2017-11-30 local grDevices * 3.4.3 2017-11-30 local grid 3.4.3 2017-11-30 local hms 0.4.0 2017-11-23 CRAN (R 3.4.2) hoardr 0.2.0 2017-05-10 CRAN (R 3.4.2) httr 1.3.1 2017-08-20 CRAN (R 3.4.1) iterators 1.0.9 2017-12-12 CRAN (R 3.4.3) jsonlite 1.5 2017-06-01 CRAN (R 3.4.1) lattice 0.20-35 2017-03-25 CRAN (R 3.4.3) magrittr * 1.5 2014-11-22 CRAN (R 3.4.1) memoise 1.1.0 2017-04-21 CRAN (R 3.4.1) methods * 3.4.3 2017-11-30 local nlme 3.1-131 2017-02-06 CRAN (R 3.4.3) parallel 3.4.3 2017-11-30 local pillar 1.1.0 2018-01-14 cran (@1.1.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.1) plyr 1.8.4 2016-06-08 CRAN (R 3.4.1) pryr * 0.1.3 2017-10-30 cran (@0.1.3) purrr 0.2.4 2017-10-18 CRAN (R 3.4.2) R6 2.2.2 2017-06-17 CRAN (R 3.4.1) rappdirs 0.3.1 2016-03-28 CRAN (R 3.4.2) Rcpp 0.12.15 2018-01-20 cran (@0.12.15) readr 1.1.1 2017-05-16 CRAN (R 3.4.1) reshape 0.8.7 2017-08-06 CRAN (R 3.4.2) reshape2 1.4.3 2017-12-11 cran (@1.4.3) rlang 0.1.6 2017-12-21 cran (@0.1.6) RMySQL 0.10.13 2017-08-14 CRAN (R 3.4.2) roxygen2 6.0.1 2017-02-06 CRAN (R 3.4.2) RPostgreSQL 0.6-2 2017-06-24 CRAN (R 3.4.2) RSQLite 2.0 2017-06-19 CRAN (R 3.4.2) stats * 3.4.3 2017-11-30 local stringi 1.1.6 2017-11-17 CRAN (R 3.4.2) stringr 1.2.0 2017-02-18 CRAN (R 3.4.1) taxize * 0.9.1.9321 2018-02-03 Github (ropensci/taxize@319e03d) taxizedb * 0.1.6 local testthat * 2.0.0 2017-12-13 CRAN (R 3.4.3) tibble 1.4.2 2018-01-22 cran (@1.4.2) tidyr 0.7.2 2017-10-16 cran (@0.7.2) tools 3.4.3 2017-11-30 local triebeard 0.3.0 2016-08-04 CRAN (R 3.4.2) urltools 1.7.0 2018-01-20 cran (@1.7.0) utils * 3.4.3 2017-11-30 local withr 2.1.1 2017-12-19 CRAN (R 3.4.3) xml2 1.2.0 2018-01-24 cran (@1.2.0) zoo 1.8-1 2018-01-08 CRAN (R 3.4.3) ```

The dev version of taxize produces the following:

> taxize::children(2, db='ncbi', ambiguous=FALSE)[[1]]
   childtaxa_id        childtaxa_name childtaxa_rank
1        508458         Synergistetes         phylum
2        203691          Spirochaetes         phylum
3        200940 Thermodesulfobacteria         phylum
4        200938        Chrysiogenetes         phylum
5        200930       Deferribacteres         phylum
6        200918           Thermotogae         phylum
7        200783             Aquificae         phylum
8         74152         Elusimicrobia         phylum
9         68297           Dictyoglomi         phylum
10        67814           Caldiserica         phylum
11        57723         Acidobacteria         phylum
12        40117           Nitrospirae         phylum
13        32066          Fusobacteria         phylum
14         1224        Proteobacteria         phylum

This is missing several taxa retrieved from taxizedb:

> taxizedb::children(2, db='ncbi', ambiguous=FALSE)[[1]]
   childtaxa_id                  childtaxa_name childtaxa_rank
1       1936987                    Balneolaeota         phylum
2       1930617                 Calditrichaeota         phylum
3       1853220                 Rhodothermaeota         phylum
4       1802340 Nitrospinae/Tectomicrobia group        no rank
5       1783272             Terrabacteria group        no rank
6       1783270                       FCB group        no rank
7       1783257                       PVC group        no rank
8        508458                   Synergistetes         phylum
9        203691                    Spirochaetes         phylum
10       200940           Thermodesulfobacteria         phylum
11       200938                  Chrysiogenetes         phylum
12       200930                 Deferribacteres         phylum
13       200918                     Thermotogae         phylum
14       200783                       Aquificae         phylum
15        74152                   Elusimicrobia         phylum
16        68297                     Dictyoglomi         phylum
17        67814                     Caldiserica         phylum
18        57723                   Acidobacteria         phylum
19        40117                     Nitrospirae         phylum
20        32066                    Fusobacteria         phylum
21         1224                  Proteobacteria         phylum

Which also matches the taxa on NCBI taxonomy

sckott commented 6 years ago

i get the same thing, will look

sckott commented 6 years ago

@arendsee

so this is the http request made

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=taxonomy&db=taxonomy&id=2&term=Bacteria%5BNext%20Level%5D&RetMax=1000&RetStart=0&api_key=useyourkey

wonder if ther's anything in that request that strikes you as off, sometyhing we could change that would bring it in line with what taxizedb gives

sckott commented 6 years ago

also @zachary-foster or @dwinter maybe you have a sense for why results are different from ENTREZ API vs. a dump of their database?

one think that I wonder about is the version of the database that ENTREZ is using could differ from what any user has on their disk if using taxizedb - one thing to note in docs somewhere at least

dwinter commented 6 years ago

Hmm... not something I know much about but I don't think it's an issue of versions. The browser is 'live', the FTP dumps are updated hourly and the eUtils databse is updated daily.

I would guess there is some trick in what exactly what to sent to elink. Using esearch instead of elink there is the special term NXLV for immediate descendants. This gets most of the ones missing from taxize:

library(rentrez)
one_down <- entrez_search(db="taxonomy", term="Bacteria[NXLV]", use_history=TRUE)
summs <- entrez_summary(db="taxonomy", web_history=one_down$web_history)
t(extract_from_esummary(summs, c("scientificname", "rank", "taxid")))
        scientificname                    rank      taxid  
1936987 "Balneolaeota"                    "phylum"  1936987
1930617 "Calditrichaeota"                 "phylum"  1930617
1853220 "Rhodothermaeota"                 "phylum"  1853220
1802340 "Nitrospinae/Tectomicrobia group" ""        1802340
1783272 "Terrabacteria group"             ""        1783272
1783270 "FCB group"                       ""        1783270
1783257 "PVC group"                       ""        1783257
629425  "Bacteria ferula"                 "species" 629425 
629405  "Bacteria bahiensis"              "species" 629405 
629404  "Bacteria baculus"                "species" 629404 
629403  "Bacteria apolinari"              "species" 629403 
629401  "Bacteria ambigua"                "species" 629401 
629398  "Bacteria acuminatocercata"       "species" 629398 
629397  "Bacteria aborigena"              "species" 629397 
629396  "Bacteria abnormis"               "species" 629396 
508458  "Synergistetes"                   "phylum"  508458 
203691  "Spirochaetes"                    "phylum"  203691 
200940  "Thermodesulfobacteria"           "phylum"  200940 
200938  "Chrysiogenetes"                  "phylum"  200938 
200930  "Deferribacteres"                 "phylum"  200930 
200918  "Thermotogae"                     "phylum"  200918 
200783  "Aquificae"                       "phylum"  200783 
74152   "Elusimicrobia"                   "phylum"  74152  
68297   "Dictyoglomi"                     "phylum"  68297  
67814   "Caldiserica"                     "phylum"  67814  
57723   "Acidobacteria"                   "phylum"  57723  
48479   "environmental samples"           ""        48479  
40117   "Nitrospirae"                     "phylum"  40117  
32066   "Fusobacteria"                    "phylum"  32066  
2323    "unclassified Bacteria"           ""        2323   
1224    "Proteobacteria"                  "phylum"  1224

Not sure how helpful this is for the specific question, but it at least shows these taxa are accessible via eUtils.... :confused:

arendsee commented 6 years ago

@sckott Hmm, nothing about the request seems off to me. Some of the missing phyla are fairly new, see https://www.ncbi.nlm.nih.gov/pubmed/27287844. I wonder if there is some something screwy on the Entrez side? Stale cached values for children ("Next Level"), perhaps?

zachary-foster commented 6 years ago

I am not sure either. Perhaps the term=Bacteria[Next Level] is filtering out some things that are associated with taxon ID 2, but not with "Bacteria" for some reason. Ideally, the term argument would not be needed, since we just want to child IDs for ID 2, regardless of the "term", but we never we able to get ENTREZ to do that.

zachary-foster commented 6 years ago

By the way, the title of this issue sounds like an interesting science fiction novel.

sckott commented 6 years ago

thanks @dwinter @arendsee @zachary-foster

@dwinter your approach might work, though i'm not sure how we'd programmatically filter out to get only the direct children. i guess we can consult our iternal data.frame of ranks and their orders and only pick the direct descendant rank from the one queried? thoughts folks?