pablopains / parseGBIF

parseGBIF package is designed to convert [Global Biodiversity Information Facility - GBIF](https://www.gbif.org/) plant specimen occurrence data to a more comprehensible format to be used for further analysis, e.g. spatial.
GNU General Public License v2.0
4 stars 1 forks source link

Issue with collector_get_name #1

Open LPDagallier opened 11 months ago

LPDagallier commented 11 months ago

Hi,

Thank you very much for developing this great tool!

Some colleagues and I have been testing the package on a custom dataset on our side, and we noticed some issues with extracting correctly the collector names for some samples. This is clearly due to a lack of standard in the recordedBy field on GBIF, and you did a great job to workaround this problem! In some cases however, the collectors_get_name() function returns incorrect collector's name. For example: collectors_get_name("A. J. PEREZ;G. BUITRÓN;W. SANTILLÁN") will correctly return PEREZ, but collectors_get_name("ALVARO J. PEREZ;G. BUITRÓN;W. SANTILLÁN") will incorrectly return ALVARO. It seems that this occurs because collectors_get_name() selects the longest of the text string between "ALVARO" and "PEREZ" on line 49 (vll = which(nchar(xx) == max(nchar(xx)))). I guess there is a specific reason why this line selects the longest text string, but in this case it returns an incorrect last name.

Unfortunately, I don't have a solution to propose for fixing this problem in collectors_get_name(), and it seems extremely tricky to accommodate all the different cases that can occur in the recordedBy GBIF field.

Reproducible examples:

text <- "ALVARO J. PEREZ;G. BUITRÓN;W. SANTILLÁN"
text <- "A. J. PEREZ;G. BUITRÓN;W. SANTILLÁN"
text <- "ROGER J. PEREZ;G. BUITRÓN;W. SANTILLÁN"
text <- "ROGER C. ANDERSON;SCOTT A. MORI"
collectors_get_name(text)
pablopains commented 11 months ago

Dear Léo-Paul,

Thank you very much for testing the package and pointing out this problem with great precision.

I am curating the world database of Rubiaceae collectors with 250 thousand lines and I also felt the need to look for some improvements.

Yes, the function returns the longest string in the main collector name. I'm working on creating a parameter that allows the user to choose, select the longest string or select the last name. The challenge with the second option is to avoid abbreviations.

The advantage of the first is that it affects the majority of Spanish names, whose paternal name is the penultimate.

I will let you know as soon as I finish writing the option to force selection of the last name.

Thank you very much Best regards Pablo Melo