ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
264 stars 58 forks source link

Australian Plant Names Index (APNI), Australian Plant Census (APC) and Australian National Species List (NSL) #813

Open dfalster opened 4 years ago

dfalster commented 4 years ago

Thanks for the great package. Wondering about the possibility of connecting taxize to these Australian taxonomic resources, which are all accessible via a good APIs, as part of the Australian National Species List (NSL) infrastructure.

There are two main services, available via https://biodiversity.org.au/nsl/services/,

As described at the above link "this section of the National Species List infrastructure delivers names and taxonomies for flowering plants, ferns, gymnosperms, hornworts, and liverworts. The data comprise names, bibliographic information, and taxonomic concepts for plants that are either native to or naturalised in Australia. ..... The taxonomy and nomenclature adopted for the APC are endorsed by the Council of Heads of Australasian Herbaria (CHAH)." There also a tree available at https://biodiversity.org.au/nsl/services/rest/tree/apni/51209179

The API is described https://biodiversity.org.au/nsl/docs/main.html

Having a programmatic interface in R to these resources would be a big deal for Australian research. If it's possible to add to taxize, this seems preferable to developing a separate package.

Can you let us know whether you think this would be possible @sckott ?

sckott commented 4 years ago

thanks @dfalster !

I'll have a look into the docs. At first glance at the docs I think it will work, but i'll get back to you soon with further thoughts

Are there equivalent data sources for Australian animals?

sckott commented 4 years ago

Actually, I played with the API a little bit but I don't see any real search capbability. For example, you can search on the website for APNI names here https://biodiversity.org.au/nsl/services/APNI but with the API I don't see any way to do the same thing. There's this https://biodiversity.org.au/nsl/docs/main.html#taxon-search API route, but it only appears to be get one name

curl -L -H "Accept: application/json" 'https://biodiversity.org.au/nsl/services/api/name/taxon-search?q=Acacia' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1695    0  1695    0     0    652      0 --:--:--  0:00:02 --:--:--   652
{
  "records": {
    "synonyms": [
      {
        "taxonID": "https://id.biodiversity.org.au/taxon/apni/51311124",
        "nameType": "scientific",
        "acceptedNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51311124",
        "acceptedNameUsage": "Acacia Mill.",
        "nomenclaturalStatus": null,
        "taxonomicStatus": "accepted",
        "proParte": "false",
        "scientificName": "Acacia Mill.",
        "scientificNameID": "https://id.biodiversity.org.au/name/apni/56859",
        "canonicalName": "Acacia",
        "scientificNameAuthorship": "Mill.",
        "parentNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51351217",
        "taxonRank": "Genus",
        "taxonRankSortOrder": "120",
        "kingdom": "Plantae",
        "class": "Equisetopsida",
        "subclass": "Magnoliidae",
        "family": "Fabaceae",
        "created": "2009-12-15 11:08:09.0",
        "modified": "2009-12-15 11:08:09.0",
        "datasetName": "APC",
        "taxonConceptID": "https://id.biodiversity.org.au/instance/apni/603762",
        "nameAccordingTo": "CHAH (2006), Australian Plant Census",
        "nameAccordingToID": "https://id.biodiversity.org.au/reference/apni/42942",
        "taxonRemarks": null,
        "taxonDistribution": "WA (native and naturalised), NT (native and naturalised), SA (native and naturalised), Qld (native and naturalised), NSW (native and naturalised), NI (naturalised), ACT (native and naturalised), Vic (native and naturalised)",
        "higherClassification": "Plantae|Charophyta|Equisetopsida|Magnoliidae|Rosanae|Fabales|Fabaceae|Acacia",
        "firstHybridParentName": null,
        "firstHybridParentNameID": null,
        "secondHybridParentName": null,
        "secondHybridParentNameID": null,
        "nomenclaturalCode": "ICN",
        "license": "http://creativecommons.org/licenses/by/3.0/",
        "ccAttributionIRI": "https://id.biodiversity.org.au/taxon/apni/51311124"
      }
    ],
    "acceptedNames": {}
  },
  "status": {
    "enumType": "org.springframework.http.HttpStatus",
    "name": "OK"
  }
}

wheres on the website you get many names in the results. I think we really need that fuzzy search capability to be able to make a get_apni() or get_apc() function - which then forms the basis for incorporating these data sources into other useful functions in taxize.

Another key thing I'd like to see is the ability to get children and parent taxa. From the output above it looks like we have parent information which is good, but not seeing a way to get taxonomic children. Do you see that in the docs?

dfalster commented 4 years ago

Hi @sckott

Thanks so much for taking a look so promptly. Really appreciate it. We're looking for a tool to query the APC and APNI programmatically. I suspect if we got this going it would be used widely within Australia, not only by my group :).

You're right, I can't see how to do search a list of taxa in the docs. (But note, I don't really know what I'm looking for either, as APIs are not something I'm good at. What would such a query look like?)

As for the children, I can see that if you select a species, you can get the parent, and if you click on the parent you can get the children. From your example above, if you follow the link that is returned for Acaia, https://id.biodiversity.org.au/taxon/apni/51311124, you get a page that lists all the children. image

So I wonder if it's a matter of first locating the id, then fetching the children?

If you can outline the interface needed, I can enquire whether it is possible with relevant people.

sckott commented 4 years ago

thanks, having a look

sckott commented 4 years ago

been working on some utility functions, install.packages("ropensci/taxize@australian"), then see ?apni - Making progress.

dfalster commented 4 years ago

Fantastic. Just tried it out, looking useful already!

So if I understand right, the two main features missing from the API are

  1. Ability to search a list of names, e.g.
> apni_search(q = c("Acacia", "Eucalyptus"))
Error: Bad Request (HTTP 400)

Can we solve this one by vectorising on the taxize side?

  1. Ability to access children? If we take acacia as an example, the id is 51311124, so https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 is the web page with children. Adding .json to the end of this gives the data, which seems to give children?
x <- jsonlite::read_json("https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json") 
x$treeElement$children[[1]]

Also, couldn't see how to extract the apni_id for given taxa after retrieving search results.

Again, I don't know APIs so the above may be off track.

sckott commented 4 years ago

yes, can vectorize the fxns, just hadn't gotten to that yet.

nice, I think that children solution should work, will try that

dfalster commented 4 years ago

Another question, how does taxize handle searches when there are spelling mistakes? I notice the APNI just returns "no results".

image

This is a very common issue with taxonomic name searches. In the past I have used Taxonstand, which included an argument max.distance: A number indicating the maximum distance allowed for a match in agrep when performing corrections of spelling errors in specific epithets. Guessing you need this on server side, so if not possible in the web interface, won't be possible via Taxize.

sckott commented 4 years ago

taxize doesn't do anything automatically regarding spelling mistakes on the R side. i consider that a separate step for sure, distinct from searching one of the data sources. data sources vary widely in how they handle spelling. some do fuzzy search in which they account for possible spelling mistakes and return the closest matches, while some data sources do not fuzzy match and so return nothing or similar on no resuts found. when no results are found we typically give back NA or similar

there are specific functions in taxize to "resolve" names. eg,. gnr_resolve() and tnrs(). i'd suggest running names through a resolver function first if there's concern there may be spelling mistakes. i wish there was a better solution.

dfalster commented 4 years ago

Thanks for explaining. Makes sense. I can sort something for some fuzzy matching locally.

I now have contact details for the folks behind the APC/APNI service, so can put you in contact or deliver questions there, as needed

sckott commented 4 years ago

Good news about getting contacts. I'll work on this soon and see if there's any questions I have

dfalster commented 4 years ago

When asked whether we should link to APC, APNI or both, Anna Monro provided this description (pasted here with her permission):

it depends on what you're trying to achieve (sorry, isn't that always the answer?).

Anna can answer questions about overall diagnosis and usage. For more technical question on the API, Anna has directed us to Anne Fuschs and her team

I'm can contact them both as needed.

pmcneil commented 4 years ago

For fuzzy search you can use the suggestions API on APNI and APC. It is meant for suggestions as you type and is case insensitive. It does not help spelling mistakes.https://biodiversity.org.au/nsl/docs/main.html#suggestions-api-v1-0

We are here on github https://github.com/bio-org-au BTW so you can contact us there too and see what the API code actually does. :-)

sckott commented 4 years ago

@pmcneil regardint the results of this request https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json - I'm interested in pulling out data for each child of the target taxon, here's the first one as a list in R:

$displayHtml
[1] "<data><scientific><name data-id='165295'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>abbatiana</element> <authors><author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <citation><ref data-id='42942'><ref-section><author>CHAH</author> <year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section></ref></citation></data>"

$elementLink
[1] "https://id.biodiversity.org.au/tree/51352295/51223378"

$nameLink
[1] "https://id.biodiversity.org.au/name/apni/165295"

$instanceLink
[1] "https://id.biodiversity.org.au/instance/apni/603763"

$excluded
[1] FALSE

$depth
[1] 9

$synonymsHtml
[1] "<synonyms><nom><scientific><name data-id='190638'><scientific><name data-id='103551'><element>Racosperma</element></name></scientific> <element>abbatianum</element> <authors>(<base data-id='1524' title='Pedley, L.'>Pedley</base>) <author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <year>(2003)</year> <type>nomenclatural synonym</type></nom><tax><scientific><name data-id='168777'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>sp. Mt Abbot (A.R.Bean 4873)</element></name></scientific><name-status class=\"[n/a]\">, [n/a]</name-status> <year>(1997)</year> <type>taxonomic synonym</type></tax></synonyms>"

Seems that I'd need to further parse those html strings to get names and other data out. Is there a different route or content type that I can request that has that data parsed already? The html in displayHtml doesn't seem to be organized in a way that I can figure out how to parse with xpath. If we look at the displayHtml from above

<html>
<body>
  <data>
    <scientific>
      <name data-id="165295">
        <scientific>
          <name data-id="56859">
            <element>Acacia</element>
          </name>
        </scientific>
        <element>abbatiana</element> 
        <authors>
          <author data-id="1524" title="Pedley, L.">Pedley</author>
        </authors>
      </name>
    </scientific>
    <name-status class="legitimate">, legitimate</name-status>
    <citation>
      <ref data-id="42942"><ref-section><author>CHAH</author><year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section>
      </ref>
    </citation>
  </data>
</body>
</html>

The first <element> is nested within its own <scientific> tag, but then the 2nd <element> is not nested within its own <scientific> tag. Maybe I'm missing something here?

pmcneil commented 4 years ago

This is linked data, so follow the links. The https://id.biodiversity.org.au/name/apni/165295 link will get you name data. If you ask for it in XML, JSON or just HTML you'll get that as a result. See https://biodiversity.org.au/nsl/docs/main.html#name for example. The display HTML is there to provide a quick way of displaying quite complex results. The embedded name HTML is marked up to a) make it parsable and b) make the display of the name in HTML configurable using CSS. Note the data-id attributes are ONLY for linking up name parts in Javascript etc. in browser, not to be stored as a reference to the object. ALWAYS use the ID (https://id....) as the reference. Once again, this is linked data.

On linked data, above you are using this 'https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json' which is fine for a question, but it is not the reference to the tree element that should be quoted or passed around, the element link is https://id.biodiversity.org.au/tree/51352295/51342774

Re the nesting of <scientific> or actually <name> elements: You're almost there, it is nested. ie. there are two name parts in the name, the Acacia is a scientific name in itself the Acacia abbatiana is a name with two parts. Using xpath name.elment = abbatiana, name.scientific.name.element = Acacia - looks counter intuitive in xpath, but the name in question here is abbatiana, and it has a parent part Acacia. (hope that makes sense :-) )

sckott commented 4 years ago

Thanks @pmcneil for the explanations. I still don't quite grok all the different identifiers. Is there any documentation on the identifiers?

chrisbitmead commented 4 years ago

All the existing documentation is at https://biodiversity.org.au/nsl/docs/main.html

There's probably not much to know about identifiers. Just remember, everything that starts with https://id.biodiversity.org.au is an identifier, identifiers are a "black box" that identifies something or other, and you can find what it identifies by going to that URL. (and if you add .json to the URL, or pass application/json as the contentType on the request, you'll find out what is behind the identifier in json format).

pmcneil commented 4 years ago

@chrisbitmead The identifiers refer to specific objects/things e.g. name, reference, author, "instance" etc.

The instance is in many ways a taxon, though the taxon link is to where an accepted instance sits in the accepted classification (APC). while we do have documentation it may not be adequate. I believe Anne is going to respond to your queries too, but I hope the above makes a some sense?

sckott commented 4 years ago

Thanks @chrisbitmead and @pmcneil for the clarifications

I've not been able to work on this for a few weeks, i'll get back to this soon

sckott commented 3 years ago

to do"

sckott commented 3 years ago

@dfalster get_apni() working now and all apni utility fxns vectorized.

children and classification not working yet.

problem with children i'm hitting is I don't see a way to get to the id needed for the page that has children from a name id., e..g, above you shared the link https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 that has taxonomic children for Acacia. however, the name id for Acacia is https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apni-format and I don't see how to get that 51311124 id programmatically to be able to get children. Any ideas?

chrisbitmead commented 3 years ago

@sckott Does this help?...

https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apc.json

sckott commented 3 years ago

@chrisbitmead ah thanks, that will probably do it

dfalster commented 3 years ago

Thanks for your continued work here @sckott !

sckott commented 3 years ago

@dfalster when you get a chance to try this out: children and classification are now done as well.

remotes::install_github("ropensci/taxize@australian")
library(taxize)
# see man files for each fxn
?`apni-search`
?apni_classification
?apni_children
?apni_family
?apni_id
?chidren
?classification
dfalster commented 3 years ago

Hi @sckott. Cool! seems to be mostly working. I can confirm that get_apni, apni_classification, apni_children, apni_search all work well.

The only issue I encountered is that apni_family does not return sensible results. E.g. The following is a search for Eucalyptus regnant, id = 101747, which should return "Myrtaceae":

> apni_family(id = 101747)
[[1]]
[[1]]$name
[1] "regnans"

[[1]]$link
[1] "https://id.biodiversity.org.au/name/apni/101747"

[[1]]$instances
# A tibble: 30 x 7
   type      link              pages   name                                      protologue citation                                   auth_year          
   <chr>     <chr>             <chr>   <chr>                                     <lgl>      <chr>                                      <chr>              
 1 secondar… https://id.biodi… 181     <scientific><name data-id='54484'><eleme… FALSE      Bailey, F.M. (1913), Comprehensive Catalo… F.M.Bailey, 1913   
 2 secondar… https://id.biodi… 54

The other BIG point to consider is that the Australian taxonomic system has two components: The Australian Plant Names Index (APNI) & the Australian Plant Census (APC). So far we have linked again the first, but ideally we would be able to query both. I'm not an expert on the distinction, but my understanding is that the APC contains all the information about currently accepted species, included whether a name is a synonym or not.

sckott commented 3 years ago

Thanks for having a look!

Okay, i'll

sckott commented 3 years ago

@dfalster hmm, do we need the family function? I don't think you asked for it as far as I can remember. I think I added it just cause the route is there, but you can easily get family with classification(101747, db = "apni"). Okay if I remove the apni_family function?

sckott commented 3 years ago

@dfalster For APC vs. APNI, it doesn't seem like a simple thing we can allow users to switch between. Let's look at the API routes used in the functions we have so far:

  • apni-search - search APNI on full name as per the apni name search service,
  • apc-search - search APC on full name as per the search service

whereas I'm using acceptableName (last part of route above). So looks like I could allow users to go between apni and apc for this fxn

sckott commented 3 years ago

@dfalster ☝🏽 any thoughts?