ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
271 stars 61 forks source link

EOL search needs to use "content" as opposed to title from JSON output #624

Closed glaroc closed 7 years ago

glaroc commented 7 years ago

For example, a search for sci2comm('Acer saccharum') returns no results. This seems to be due to the fact that eol_search(terms = 'Acer saccharum') does not contain an exact match since attributions are added to the species name. From my understanding, species listed in the json output of the api call often contain the author in the "title" field, while the plain species names are listed in the content field. In this case, page 582247 is the correct one.

sckott commented 7 years ago

thanks @glaroc !

Two issues:

  1. I'm not sure we want to go with the content field in the returned data. Going with your example taxon, here's the first five results from content and title fields
> vapply(res$results[1:5], "[[", "", "content")
[1] "Acer saccharum var. floridanum (Chapm.) Small & A. Heller; Acer saccharum var. floridanum Small & A. Heller"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[2] "Acer barbatum Michx.; Acer barbatum; Acer floridanum; Acer saccharum floridanum; Acer floridanum (Chapm.) Pax; Acer saccharum subsp. floridanum; Acer floridanum var. longii Fernald; Acer floridanum Pax"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[3] "Acer barbatum Michx.; Acer floridanum (Chapman) Pax; Saccharodendron barbatum (Michx.) Nieuwl.; Saccharodendron floridanum (Chapman) Nieuwl.; Acer saccharinum var. floridanum Chapman; Acer barbatum var. longii (Fern.) Fern.; Acer barbatum var. villipes (Rehd.) Ashe; Acer floridanum var. longii Fern.; Acer floridanum var. villipes Rehd.; Acer nigrum var. floridanum (Chapman) Fosberg; Acer saccharum var. floridanum (Chapman) Small & Heller; Acer barbatum; Acer saccharum subsp. floridanum (Chapm.) Desmarais; Acer saccharinum var. floridanum Chapm.; Saccharodendron floridanum (Chapm.) Nieuwl.; Acer saccharum subsp. floridanum (Chapman) Desmarais; Acer floridanum var. longii Fernald; Acer barbatum var. longii (Fernald) Fernald; Acer saccharum ssp. floridanum (Chapm.) Desmarais; Acer barbatum var. villipes (Rehder) Ashe; Acer floridanum var. villipes Rehder; Acer nigrum var. floridanum (Chapm.) Fosberg"
[4] "Acer nigrum Michx. f.; Acer nigrum; Acer saccharum nigrum; Acer nigrum F. Michx.; Acer saccharum subsp. nigrum; Acer nigrum F.Michx. (1812); Acer saccharum var. nigrum Britton"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[5] "Saccharodendron nigrum (Michx. f.) Small; Acer saccharum var. viride (Schmidt) E. Murr.; Acer nigrum var. palmeri Sarg.; Acer saccharum var. nigrum (Michx. f.) Britt.; Acer nigrum; Acer saccharum subsp. nigrum (F. Michx.) Desmarais; Acer saccharum subsp. nigrum (Michx. f.) Desmarais; Acer nigrum Michx.; Acer saccharum subsp. nigrum (Michx.) Desmarais; Saccharodendron nigrum (F. Michx.) Small; Acer saccharum ssp. nigrum (F. Michx.) Desmarais; Acer saccharum var. viride (Schmidt) E. Murray; Acer saccharum var. nigrum (F. Michx.) Britton"                                                                                                                                                                                                                                                                                                                                                                                 
> vapply(res$results[1:5], "[[", "", "title")
[1] "Acer floridanum (Chapm.) Pax" "Acer floridanum (Chapm.) Pax" "Acer floridanum (Chapm.) Pax" "Acer nigrum F. Michx."        "Acer nigrum F. Michx."

I'm not sure what the content field contains exactly, but it's much more text than the title field.


  1. There was actually a bug that I fixed for your other issue #625 that applies here - we go out and get data for all the page Ids that match the taxon you search for, and some of the page id requests result in HTTP reqeusts that error out - so I'm doing some tryCatch to fix that
glaroc commented 7 years ago

I was thinking that the content field is more appropriate because the only entries that contain the actual species name "acer saccharum" in that field are the ones that reference the proper page id (582247). Maybe this doesn't apply to all situations however.

sckott commented 7 years ago

i'm not sure i follow. can you clarify

glaroc commented 7 years ago

In the json output (http://eol.org/api/search/1.0.json?q=acer+saccharum&page=1&exact=false&filter_by_taxon_concept_id=&filter_by_hierarchy_entry_id=&filter_by_string=&cache_ttl=false), the entries with "id":582247 are the correct ones, and they also contain the correct species name with not attribution in the content field. Doing a search for Acer saccharum on the EOL website also returns page 582247.

sckott commented 7 years ago

the sci2comm('Acer saccharum') example should be fixed now, reinstall and try again

glaroc commented 7 years ago

Yes, that works!

sckott commented 7 years ago

we should probably just return the output of content field as well as the link field, so eol_search would do e.g,.

x <- eol_search('Acer saccharum')
str(x)
#> 'data.frame':    23 obs. of  4 variables:
#>  $ pageid : int  583023 583023 583023 596825 596825 583022 583021 583021 1245035 1249734 ...
#>  $ name   : chr  "Acer floridanum (Chapm.) Pax" "Acer floridanum (Chapm.) Pax" "Acer floridanum (Chapm.) Pax" "Acer nigrum F. Michx." ...
#>  $ link   : chr  "http://eol.org/583023?action=overview&controller=taxa" "http://eol.org/583023?action=overview&controller=taxa" "http://eol.org/583023?action=overview&controller=taxa" "http://eol.org/596825?action=overview&controller=taxa" ...
#>  $ content: chr  "Acer saccharum var. floridanum (Chapm.) Small & A. Heller; Acer saccharum var. floridanum Small & A. Heller" "Acer barbatum Michx.; Acer barbatum; Acer floridanum; Acer saccharum floridanum; Acer floridanum (Chapm.) Pax; "| __truncated__ "Acer barbatum Michx.; Acer floridanum (Chapman) Pax; Saccharodendron barbatum (Michx.) Nieuwl.; Saccharodendron"| __truncated__ "Acer nigrum Michx. f.; Acer nigrum; Acer saccharum nigrum; Acer nigrum F. Michx.; Acer saccharum subsp. nigrum;"| __truncated__ ...

i'm not sure what else can be done though since the content field is pretty variable in what it contains, sometimes the first name has no authority, sometimes it does, sometimes it matches the entry in title, sometimes it doesn't. And there's no metadata as to what the different semi-colon entries within content represent, and I don't see any EOL docs that explain it

glaroc commented 7 years ago

Well, if the original bug with sci2comm('Acer saccharum') is not related to the use of the title vs content field, I'm not sure anything needs to be done with the content field.

sckott commented 7 years ago

Okay, i might still return the other fields link and content