Reduce number of NONE in name matching

peterdesmet commented 5 years ago

Fix pasted NA

The first issue is that with pasting acceptedName and scientificnameauthorship, NA values are returned as Narcine NA

I would add a single step to prepare the data for matching:

taxa <-
  taxa %>%
  mutate(name_for_gbif = if_else(
    !is.na(scientificnameauthorship),
    paste(acceptedname, scientificnameauthorship),
    acceptedname
  )) %>%
  mutate(rank_for_gbif = recode(taxonranken,
    "forma" = "form",
    "subforma" = "subform",
    "informal group" = NA_character_
  ))

Fix UTF-8 reading of data

Some of the author names have odd characters:

Cicadula frontalis (Herrich-Schffer, 1835)

@damianooldoni, that is because the taxa_no_match_GBIF_backbone.tsv file I got from you was encoded in Windows-1252. Converting that to UTF8 fixed some of the issues.

peterdesmet commented 5 years ago

To avoid clashes between column names, I would just rename the kingdom you get from GBIF with gbif_kingdom:

rename("gbif_kingdom" = "kingdom")

peterdesmet commented 5 years ago

Getting rid of NA in the names already improves the matching dramatically:

matchType	n
EXACT	3955
FUZZY	7
NONE	883

With UTF-8 fix beforehand:

matchType	n
EXACT	4069
FUZZY	9
NONE	767

peterdesmet commented 5 years ago

Optionally

After the first name matching, I would run the NONE again, but this time just passing the acceptedname without the scientificnameauthorship:

matchType	n
EXACT	239
FUZZY	3
NONE	641

I would keep this in two steps, first with author, then without, because providing the authorname in the first step avoids NONE matches caused by the name existing with two different authors. Matching without the author in the second step reduces NONE matches because of deviating author spellings:

in db: Neosartorya fischeri var. glabra (Fennell & Raper) Malloch & Cain
in gbif: Neosartorya fischeri var. glabra Fennell & Raper, 1973

As some of the matches might be wrong, I would update the matchType for these to:

EXACT_WITHOUT_AUTHOR
FUZZY_WITHOUT_AUTHOR

peterdesmet commented 5 years ago

Regarding matching without author: I would refrain from doing this until we have more information regarding how many unmatched taxa have occurrences. There might be valid reasons why taxon with author A is different than taxon with author B:

10997   Agromyza riparia Van der Wulp, 1871 Agromyza riparia Malloch, 1915
18817   Apion rubiginosum Balfour-Browne , 1944 Apion rubiginosum Grill, 1893
35592   Aprostocetus Graham, 1987   Aprostocetus Westwood, 1833
17929   Bidessus unistriatus (Schrank, 1781)    Bidessus unistriatus (Goeze, 1777)
48330   Bolbolaimus denticulatus Gerlach, 1953  Bolbolaimus denticulatus Cobb, 1920
18372   Brachypterus glaber (Stephens, 1832)    Brachypterus glaber (Newman, 1834)
17709   Caenopsis fissirostris Gutfleisch, V ed. Bose, F.C. , 1859  Caenopsis fissirostris (Walton, 1847)
51245   Chiron Villa, 1833  Chiron Macleay, 1819
2347    Cis nitidus (F., 1792)  Cis nitidus (Hatch, 1962)
36645   Clogmia Jezek, 1983 Clogmia Enderlein, 1937
34341   Coniopteryx Meinander, 1972 Coniopteryx Curtis, 1834
48170   Coregonus lavaretus Linnaeus, 1758  Coregonus lavaretus Valenciennes, 1848
20700   Crepis vesicaria Thuill.    Crepis vesicaria L.
17643   Curculio glandium Desbrochers, J. , 1868    Curculio glandium Marsham, 1802
20187   Cyrnus flavidus Mabille 1871    Cyrnus flavidus McLachlan, 1864

damianooldoni commented 5 years ago

NA are not anymore present and UTF-8 encoding is applied since begin by reading data from database with parameter encoding = latin1.

About: https://github.com/inbo/ibge-bim-species/issues/9#issuecomment-482603604: yes, first check how many taxa are linked to occurrences.

Taxa with/without occurrences

I added number of occurrences to taxa information and I see with a certain surprise that just a small subset of taxa hare linked to occurrences.

has_occs	n
FALSE	43722
TRUE	8111

unmatched taxa with/without occurrences

In particular, for unmatched taxa (matching by acceptedname + scientificnameauthorship):

has_occs	n
FALSE	939
TRUE	266

@peterdesmet : I would still go further by trying to match by accetepdname only as you suggested in https://github.com/inbo/ibge-bim-species/issues/9#issuecomment-482583012. It would be informative for the experts while manual checking and it takes not so much of my time to program it. What do you think?

peterdesmet commented 5 years ago

Agree to try to match those by acceptedName only. I would indicate with a flag (e.g. add to matchType) how they were mapped, so we can differentiate.

damianooldoni commented 5 years ago

To improve match, I see that passing names wiithout kingdom can reduce number of NONE.

Example:

in db: phylum Xanthophyta is linked by parent id to kingdom Plantae
in gbif backbone: phylum Xanthophyta is linked to kingdom Chromista

So, I added two match steps:

full name (acceptedname + scientificnameauthorship) without passing kingdom
acceptedname only without passing kingdom

Flagged matchType:

EXACT_WITHOUT_KINGDOM from step 1,
EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM, FUZZY_WITHOUT_AUTHOR_WITHOUT_KINGDOM from steps 2

So, in total 4 match attempts. See #10.

peterdesmet commented 5 years ago

Good idea to match without kingdom, but I would reduce (and thus simplify) the number of match steps.

I think there will be no difference between EXACT and EXACT_WITHOUT_KINGDOM, so would only do EXACT.

There might be a difference between EXACT WITHOUT AUTHOR and EXACT WITHOUT AUTHOR/KINGDOM. These two could be kept.

damianooldoni commented 5 years ago

As result of the actual 4 steps match workflow, we have the following table:

matchType	n_taxa
EXACT	50628
EXACT_WITHOUT_AUTHOR	177
EXACT_WITHOUT_KINGDOM	89
EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM	7
FUZZY_WITHOUT_AUTHOR_WITHOUT_KINGDOM	398
NONE	534

So, matching without kingdom helps to match 89 taxa. That's not a small number, isn't?

I don't fully understand your last sentence:

There might be a difference between EXACT WITHOUT AUTHOR and EXACT WITHOUT AUTHOR/KINGDOM. These two could be kept.

peterdesmet commented 5 years ago

But those 89 EXACT_WITHOUT_KINGDOM will probably be bundled with the 7 EXACT_WITHOUT_AUTHOR_WITHOUT_KINGDOM

My last sentence isn't that important, I would just have:

EXACT EXACT_WITHOUT_AUTHOR EXACT_WITHOUT_AUTHOR_AND_KINGDOM FUZZY_WITHOUT_AUTHOR_AND_KINGDOM NONE

damianooldoni commented 5 years ago

Thanks @peterdesmet : now understand it perfectly. Yes, it makes sense... I wait for review of @LienReyserhove and eventually her commits before applying this change to avoid conflicts.

inbo / ibge-bim-species