phyloref / klados

A curation tool to edit test cases for the Phyloref curation workflow
http://www.phyloref.org/klados/
MIT License
2 stars 1 forks source link

Attempt to parse specifiers directly from the clade definition #18

Open gaurav opened 6 years ago

gaurav commented 6 years ago

This will likely take some pretty sophisticated regular expressions, so we might want to wait until we have a better idea of what clade definitions (particularly those in RegNum) look like.

hlapp commented 6 years ago

If the specifier is a taxon name, we should use gnparser, so the task reduces to determining where a taxon name string starts and likely ends. (@dimus it seems gnparser will not strip off extraneous leading and trailing text, and instead expects the name string to start at position 0, right?)

If the specifier is a specimen reference, I suspect iDigBio has some code we can borrow from?

gaurav commented 6 years ago

This might be easier to do with abbreviated PhyloCode definitions.

dimus commented 6 years ago

@hlapp gnparser is able to break scientific names that are already extracted, but would not be able to fetch them from a text. For name extraction I work on https://github.com/gnames/gnfinder project. Can you give an example of specifiers, so I can understand the problem better?

gaurav commented 6 years ago

Hi @dimus, and thanks for your interest! Here is an example clade definition from Fisher et al, 2007:

Syrrhopodon

nomen cladi conversum, Syrrhopodon gardneri (Hook.) Schwägr., Sp. Musc. Frond. Suppl. 2(1): 110, tab. 131, figs. 1–13. (1824)

Stem-based definition:

  • internal specifier: Type: Syrrhopodon croceus Mitt., J. Proc. Linn. Soc., Bot. Suppl. 1: 41. (1859)
  • internal specifier: Type: Leucophanes octoblepharoides Brid., Bryol. Univ. 1: 763. (1826)
  • external specifier: Type: Syrrhopodon mauritianus Müll. Hal. ex Ångstr., Öfv. Förh. Kongl. Svenska Vet.-Akad. 33(4): 54. (1876)

Our curators can copy the entire specifier (e.g. "Type: Leucophanes octoblepharoides Brid., Bryol. Univ. 1: 763. (1826)") into a verbatim specifier field. Once we implement this issue, we hope to send this string to gnparser and have it identify the genus name, specific epithet and authority. Would gnparser be confused by the "Type: " at the start of the scientific name? If so, we could use a simple regular expression to try to find the start of the binomial name, and then feed the rest of the string to gnparser for splitting.

The next step would be to see if we could parse the entire clade definition and identify all the specifiers. One way of doing that would be to use gnfinder to look for the scientific names, and then try to determine if each scientific name should be treated as an internal or external specifier (by looking for the word "internal" right before it, say). Since all of this work is for our curation tool, we could take a guess and then allow our curators to fix any incorrect guesses.

Thanks again!