matildabrown / rWCVP

Generating Summaries, Reports and Plots from the World Checklist of Vascular Plants
https://matildabrown.github.io/rWCVP/
GNU General Public License v3.0
19 stars 0 forks source link

Separating author information #47

Open alrichardbollans opened 1 year ago

alrichardbollans commented 1 year ago

I notice in the matching process it is assumed that author information isn't given in the name_col and is either in a separate column or not appearing at all. However, in many datasets all this information is given in the same column e.g. Condylocarpus Hoffm. The matching program still seems to work for some of these cases (by editing out the author names), but I wonder how you handle these situations with the package/if you have a method for separating author information into a separate column prior to matching?

matildabrown commented 1 year ago

My method so far has been to split the name and author using the spaces and e.g. str_split. It only really works (for species-level matching) when the first two words are 'Genus' and 'species' (the rest doesn't matter because the name can be matched without the author string). However, it gets really messy really quickly once infraspecifics and hybrids are involved - variable number of words before author string, the first two are not always genus and specific epithet, and the author strings can even be embedded into the name portion (e.g. Genus species Auth1 subsp. subspecies Auth2). There are algorithmic workarounds if the dataset is consistent, but we don't have a general solution I'm afraid.