Here we gathered tools to assist the name matching of (biological) collector names to other resources, like WikiData names and related IDs aso. This approach is based primarily on Niels Klazenga’s work from the Virtual Herbarium of Australia (☞ https://github.com/nielsklazenga/avh-collectors/), thank you for that ;-)
What you need first:
Steps in general:
Getting Data (source names)
Getting Data (resource names)
Matching of Names
Decide the associated linkages
As a visual summary, the processing goes like:
flowchart LR
get_data["getting data"] ---> match_data["matching data"] --> results["results and output"]
source["fa:fa-table collector names\n(source)"] -->|dwcagent\nparsing| prepareSource["prepare names/\nname lists"]
resource["fa:fa-table wikidata names\n(resource)"] --> prepareResource["prepare names"]
prepareSource --> matching{"fa:fa-cogs\nngram-language-analysis\nk-means distance/\ncosine similarity\n…"}
prepareResource --> matching
matching --> CSVoutput["fa:fa-table CSV output\naccording to \nDwC agent attribution"]
Two approaches to calculating name similarities and distances were pursued for this code, which are labelled with tags:
vX.X-match-family-last
name matching has “given + particle … family, suffix“, newer calculation approach, e.g. v0.1-match-family-last (of 2023-11-21)vX.X-match-family-first
name matching has “family, given + particle …”, old calculation approach (won’t continue), e.g. v1.0-match-family-first (of 2023-11-16, commit 47178e…)Get resource names of WikiData to compare collector source names with:
create_wikidata_datasets_botanists.ipynb
—to get data of botanists from WikiDataGet or construct source names, i.e. collector name lists—see the following examples where we use, in most cases, the GIBF occurrence data of the institutions themselves:
Institution | Remarks | Script(s) |
---|---|---|
BGBM | plain name data | create_bgbm_gbif-occurrence_collectors_dataset.ipynb |
BGBM | name data with collection date (eventDate ) for life time comparison |
create_bgbm_gbif-occurrence_collectors_eventDate_dataset.ipynb |
Meise | name data with collection date (eventDate ) for life time comparison |
create_meise_gbif-occurrence_collectors_eventDate_dataset.ipynb |
Naturalis | name data with collection date (eventDate ) for life time comparison |
create_naturalis_gbif-occurrence_collectors_eventDate_dataset.ipynb |
Plazi | Plazi‘s Collection Statistics “Materials Citation Data” | create_plazi_collectors_dataset.ipynb create_and_match_plazi_collectors_dataset.ipynb |
See ☞ bin/README.md
.
Institution | Remarks | Script(s) |
---|---|---|
BGBM | cosine-similarity, with collection date (eventDate ) for life time comparison |
match_names_BGBM-dwcagent-parsed-eventDate_vs_WikiData_cosine-similarity.ipynb |
BGBM | k-nearest neighbour distance, with collection date (eventDate ) for life time comparison |
match_names_BGBM-dwcagent-parsed-eventDate_vs_WikiData_k-nearest.ipynb |
Meise | cosine-similarity, with collection date (eventDate ) for life time comparison |
match_names_Meise-dwcagent-parsed-eventDate_vs_WikiData_cosine-similarity.ipynb |
Meise | k-nearest neighbour distance, with collection date (eventDate ) for life time comparison |
match_names_Meise-dwcagent-parsed-eventDate_vs_WikiData_k-nearest.ipynb |
Naturalis | cosine-similarity, with collection date (eventDate ) for life time comparison |
match_names_Naturalis-dwcagent-parsed-eventDate_vs_WikiData_cosine-similarity.ipynb |
Naturalis | k-nearest neighbour distance, with collection date (eventDate ) for life time comparison |
match_names_Naturalis-dwcagent-parsed-eventDate_vs_WikiData_k-nearest.ipynb |
Plazi | k-nearest neighbour distance, with citation date for life time comparison | create_and_match_plazi_collectors_dataset.ipynb |
BGBM examples without eventDate
(sampling date): result data removed, practically old code, it’s better to have some kind of sampling date/eventDate reference to match also the life time of a collector
See TODO.md