ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
267 stars 60 forks source link

bug: NA returned by get_uid when using division_filter #909

Open davised opened 1 year ago

davised commented 1 year ago

Running into a strange bug. I'm querying some various bacterial taxa, and a handful of my taxa have conflicting genus names in the database, e.g. Rhocococcus.

I attempted to resolve this issue using the division_filter option, but then I realized I was still getting NA returned.

I tried to solve using the rows = option, but I don't think I can depend on a particular row being the proper one. Here, the second row is what I want.

The odd thing is that the function returns somewhat as expected with the division_filter = "bacteria", yet the returned value is NA.

> taxize::get_uid("Rhodococcus", rank_query = "Genus", rows = 2)
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

✔  Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1
• Found: 1
• Not Found: 0
[1] "1827"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] TRUE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1827"
> taxize::get_uid("Rhodococcus", rank_query = "Genus", division_filter = "bacteria")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

✔  Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1
• Found: 1
• Not Found: 0
[1] NA
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] FALSE
Session Info ```r > sessioninfo::session_info() ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.1.3 (2022-03-10) os Fedora Linux 36 (Workstation Edition) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Los_Angeles date 2023-03-06 pandoc 2.14.0.3 @ /usr/bin/pandoc ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date (UTC) lib source ape 5.6-2 2022-03-02 [1] CRAN (R 4.1.2) bold 1.2.0 2021-05-11 [1] CRAN (R 4.1.3) cli 3.4.1 2022-09-23 [1] CRAN (R 4.1.3) codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.3) conditionz 0.1.0 2019-04-24 [1] CRAN (R 4.1.3) crayon 1.5.2 2022-09-29 [1] CRAN (R 4.1.3) crul 1.3 2022-09-03 [1] CRAN (R 4.1.3) curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.2) data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.2) foreach 1.5.2 2022-02-02 [1] CRAN (R 4.1.2) httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.1.3) iterators 1.0.14 2022-02-05 [1] CRAN (R 4.1.2) jsonlite 1.8.2 2022-10-02 [1] CRAN (R 4.1.3) lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.3) magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3) nlme 3.1-155 2022-01-16 [2] CRAN (R 4.1.3) plyr 1.8.7 2022-03-24 [1] CRAN (R 4.1.3) R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2) Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.1.3) reshape 0.8.9 2022-04-12 [1] CRAN (R 4.1.3) sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.3) stringi 1.7.8 2022-07-11 [1] CRAN (R 4.1.3) stringr 1.4.1 2022-08-20 [1] CRAN (R 4.1.3) taxize 0.9.100 2022-04-22 [1] CRAN (R 4.1.3) triebeard 0.3.0 2016-08-04 [1] CRAN (R 4.1.3) urltools 1.7.3 2019-04-14 [1] CRAN (R 4.1.3) uuid 1.1-0 2022-04-19 [1] CRAN (R 4.1.3) xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.2) zoo 1.8-10 2022-04-15 [1] CRAN (R 4.1.3) [1] /home/davised/R/x86_64-redhat-linux-gnu-library/4.1 [2] /usr/lib64/R/library [3] /usr/share/R/library ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── ```
davised commented 1 year ago

For completeness, I tested a second genus, Paracoccus, and got the same bug.

Looks like that one also prefers the second row. Maybe I can try rows = 2 and see if that's sufficient for now.

zachary-foster commented 1 year ago

Hi Ed!

So the issue is that the division is technically called "high G+C Gram-positive bacteria" for whatever reason. You can see that when you run it without a division filter:

> taxize::get_uid("Rhodococcus", rank_query = "Genus")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

More than one UID found for taxon 'Rhodococcus'!

            Enter rownumber of taxon (other inputs will return 'NA'):

  status  rank                        division scientificname commonname     uid genus species subsp modificationdate
1 active genus                   scale insects    Rhodococcus            1661425                     2015/09/16 00:00
2 active genus high G+C Gram-positive bacteria    Rhodococcus               1827                     2022/09/18 00:00

Note that the input to division_filter is a regex. bacteria on its own does not work because taxize adds ^ and $ to the division_filter argument automatically, although I think I will remove that so partial matches are possible. Here is what you would need to use to make it work:

> taxize::get_uid("Rhodococcus", rank_query = "Genus", division_filter = "high G\\+C Gram-positive bacteria")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

✔  Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1 
• Found: 1 
• Not Found: 0
[1] "1827"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] TRUE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1827"

Note that + is a regex character so it needs to be escaped with \\. After the change I just made to the version on github, just "bacteria" will work in this instance as well.

davised commented 1 year ago

Ah I figured it was a regex but I didn't realize that it was like grep -x, which is why I thought it was a bug.

I think having a flag (e.g. full=TRUE) by default could be fine, then I can set it to full=FALSE for my usage.

That way folks can choose how it works. Might make sense to let folks know it's a regex as well in the first place (maybe that's documented and I didn't see it).

Cheers