Closed peterdesmet closed 5 years ago
To avoid clashes between column names, I would just rename the kingdom
you get from GBIF with gbif_kingdom
:
rename("gbif_kingdom" = "kingdom")
Getting rid of NA
in the names already improves the matching dramatically:
matchType | n |
---|---|
EXACT | 3955 |
FUZZY | 7 |
NONE | 883 |
With UTF-8 fix beforehand:
matchType | n |
---|---|
EXACT | 4069 |
FUZZY | 9 |
NONE | 767 |
After the first name matching, I would run the NONE
again, but this time just passing the acceptedname
without the scientificnameauthorship
:
matchType | n |
---|---|
EXACT | 239 |
FUZZY | 3 |
NONE | 641 |
I would keep this in two steps, first with author, then without, because providing the authorname in the first step avoids NONE matches caused by the name existing with two different authors. Matching without the author in the second step reduces NONE matches because of deviating author spellings:
Neosartorya fischeri var. glabra (Fennell & Raper) Malloch & Cain
Neosartorya fischeri var. glabra Fennell & Raper, 1973
As some of the matches might be wrong, I would update the matchType
for these to:
EXACT_WITHOUT_AUTHOR
FUZZY_WITHOUT_AUTHOR
Regarding matching without author: I would refrain from doing this until we have more information regarding how many unmatched taxa have occurrences. There might be valid reasons why taxon with author A is different than taxon with author B:
10997 Agromyza riparia Van der Wulp, 1871 Agromyza riparia Malloch, 1915
18817 Apion rubiginosum Balfour-Browne , 1944 Apion rubiginosum Grill, 1893
35592 Aprostocetus Graham, 1987 Aprostocetus Westwood, 1833
17929 Bidessus unistriatus (Schrank, 1781) Bidessus unistriatus (Goeze, 1777)
48330 Bolbolaimus denticulatus Gerlach, 1953 Bolbolaimus denticulatus Cobb, 1920
18372 Brachypterus glaber (Stephens, 1832) Brachypterus glaber (Newman, 1834)
17709 Caenopsis fissirostris Gutfleisch, V ed. Bose, F.C. , 1859 Caenopsis fissirostris (Walton, 1847)
51245 Chiron Villa, 1833 Chiron Macleay, 1819
2347 Cis nitidus (F., 1792) Cis nitidus (Hatch, 1962)
36645 Clogmia Jezek, 1983 Clogmia Enderlein, 1937
34341 Coniopteryx Meinander, 1972 Coniopteryx Curtis, 1834
48170 Coregonus lavaretus Linnaeus, 1758 Coregonus lavaretus Valenciennes, 1848
20700 Crepis vesicaria Thuill. Crepis vesicaria L.
17643 Curculio glandium Desbrochers, J. , 1868 Curculio glandium Marsham, 1802
20187 Cyrnus flavidus Mabille 1871 Cyrnus flavidus McLachlan, 1864
NA are not anymore present and UTF-8 encoding is applied since begin by reading data from database with parameter encoding = latin1
.
About: https://github.com/inbo/ibge-bim-species/issues/9#issuecomment-482603604: yes, first check how many taxa are linked to occurrences.
I added number of occurrences to taxa information and I see with a certain surprise that just a small subset of taxa hare linked to occurrences.
has_occs | n |
---|---|
FALSE | 43722 |
TRUE | 8111 |
In particular, for unmatched taxa (matching by acceptedname
+ scientificnameauthorship
):
has_occs | n |
---|---|
FALSE | 939 |
TRUE | 266 |
@peterdesmet : I would still go further by trying to match by accetepdname
only as you suggested in https://github.com/inbo/ibge-bim-species/issues/9#issuecomment-482583012. It would be informative for the experts while manual checking and it takes not so much of my time to program it. What do you think?
Agree to try to match those by acceptedName only. I would indicate with a flag (e.g. add to matchType) how they were mapped, so we can differentiate.
To improve match, I see that passing names wiithout kingdom can reduce number of NONE.
Example:
Xanthophyta
is linked by parent id to kingdom Plantae
Xanthophyta
is linked to kingdom Chromista
So, I added two match steps:
acceptedname
+ scientificnameauthorship
) without passing kingdomacceptedname
only without passing kingdomFlagged matchType:
EXACT_WITHOUT_KINGDOM
from step 1, EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM
, FUZZY_WITHOUT_AUTHOR_WITHOUT_KINGDOM
from steps 2So, in total 4 match attempts. See #10.
Good idea to match without kingdom, but I would reduce (and thus simplify) the number of match steps.
I think there will be no difference between EXACT and EXACT_WITHOUT_KINGDOM, so would only do EXACT.
There might be a difference between EXACT WITHOUT AUTHOR and EXACT WITHOUT AUTHOR/KINGDOM. These two could be kept.
As result of the actual 4 steps match workflow, we have the following table:
matchType | n_taxa |
---|---|
EXACT | 50628 |
EXACT_WITHOUT_AUTHOR | 177 |
EXACT_WITHOUT_KINGDOM | 89 |
EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM | 7 |
FUZZY_WITHOUT_AUTHOR_WITHOUT_KINGDOM | 398 |
NONE | 534 |
So, matching without kingdom helps to match 89 taxa. That's not a small number, isn't?
I don't fully understand your last sentence:
There might be a difference between EXACT WITHOUT AUTHOR and EXACT WITHOUT AUTHOR/KINGDOM. These two could be kept.
But those 89 EXACT_WITHOUT_KINGDOM will probably be bundled with the 7 EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM
My last sentence isn't that important, I would just have:
EXACT EXACT_WITHOUT_AUTHOR EXACT_WITHOUT_AUTHOR_AND_KINGDOM FUZZY_WITHOUT_AUTHOR_AND_KINGDOM NONE
Thanks @peterdesmet : now understand it perfectly. Yes, it makes sense... I wait for review of @LienReyserhove and eventually her commits before applying this change to avoid conflicts.
Fix pasted NA
The first issue is that with pasting
acceptedName
andscientificnameauthorship
,NA
values are returned asNarcine NA
I would add a single step to prepare the data for matching:
Fix UTF-8 reading of data
Some of the author names have odd characters:
@damianooldoni, that is because the
taxa_no_match_GBIF_backbone.tsv
file I got from you was encoded inWindows-1252
. Converting that toUTF8
fixed some of the issues.