pablopains / parseGBIF

parseGBIF package is designed to convert [Global Biodiversity Information Facility - GBIF](https://www.gbif.org/) plant specimen occurrence data to a more comprehensible format to be used for further analysis, e.g. spatial.
GNU General Public License v2.0
4 stars 1 forks source link

Alternative collection event key #2

Open LPDagallier opened 11 months ago

LPDagallier commented 11 months ago

Hi again,

In addition to the extreme trickiness of extracting the correct collector's last names (see issue #1), I foresee another issue with using the collector's name in the collection event key: in the unfortunate case where 2 collectors with the same last name (e.g. common names like "Rodriguez" or "Smith") collect under a same number a plant from the same family. I know it is unlikely, but I guess it can happen in super diverse plant families.

What about using the collection date to identify a collection event as unique? E.g. a key composed of family+genus+species+collection date+collection number, or family+collector name+collection number+collection date

Another option would be to allow the users to compose their own collection key according to their needs using fields in the dataset retrieved from cleancoords_parse_full(). I guess it can be done with slight modifications of the function generate_collection_event_key(), with the option to provide the key in the form of a character vector, e.g. generate_collection_event_key(key = c("Ctrl_family", "Ctrl_nameRecordedBy_Standard", "Ctrl_recordNumber")).

Let me know if that make sense, and if I can help in any way. Again, thank you very much for developing such a great tool ! Léo-Paul

pablopains commented 11 months ago

Dear Léo-Paul,

Thank you very much for testing the package and helping us with your comments and suggestions for improving the package. It's a great idea to allow users to assemble their duplicate union keys. I will implement it in the next version.

A little history from the collector's dictionary:

The concept of a collector's dictionary was developed during my doctorate, 2015-2020, based on experiments by Alberto Vicentini (INPA, Manaus). Then, between 2020 and 2023, I implemented the tool for the CNCFlora-JBRJ (Brazilian flora red list assessment) workflow. Finally, we published the Peperomia case study workflow in Moura et all (http://hdl.handle.net/11449/246208). We have now improved record selection in parseGBIF.

The key consisting of family, last name and collection number was proposed by Nicky Nicholson (KEW) https://www.gbif.org/news/4n8ZCfuK3zxseKAHRMcfA8/award-winner-uses-data-mining-and-machine-learning- to-identify-collectors-and-duplicated-herbarium-specimens

Certainly composing the key with the collection date would be the best option to avoid overlapping, however, the collection date information, even if reduced to the year of collection, is often not entered into the databases. At CNCFlora-JBRJ we tested the key made up of the year. But due to the lack of collection date information in the databases, the result was less satisfactory than with the last name of the main collector.

Yes, it is hard work to check the extracted dictionary. In our tests, the extraction function matches more than 90% of surnames, and I am working to improve it.

We are processing the entire GBIF database of preserved plant specimens to generate a collector's dictionary that we will share in the package. This way we hope to drastically reduce the need for users to conference. I am fully dedicated to this.

I am available to work in partnership to improve the data package.

Thank you very much Best regards Pablo Melo

eliane-anunciacao commented 4 months ago

Hello! I'm wondering if it's possible to use the package for a list of species. I noticed the example uses a family, but I'm working with a list of species from various families.

pablopains commented 4 months ago

Hello Eliane, thanks for the message.

The set of data to be processed depends on the filters used in the search for occurrences carried out on the GBIF portal.

Yes, it is possible to work with a list of species (preferably with their respective synonyms), however there would be interference from the GBIF taxonomic resolution, and the search would only reach identified specimens, leaving out unidentified exsiccates, also leaving out the with wrong identification.

We downloaded by botanical families to try to avoid these "interferences" in obtaining data from GBIF. In these models, it would later be possible to filter the target species.

If you want, I could help you find a way to obtain the data for your research. What would be the objective of your study, taxonomy, distribution modeling, biogeography, conservation? I would be happy to try to help. My email for details is pablopains@yahoo.com.br.

pablopains commented 4 months ago

Dear Léo-Paul,

Following your recommendation, I implemented an improvement in the function parseGBIF::collectors_get_name()

Using the last_name_selection_type parameter, it is possible to select the return whether last_name or large_string.

I used both methods in parallel to speed up manual verification, where the results of both methods coincide, errors are less likely to occur in extracting the surname from the main collector.

thanks