riparias / rato-occurrences

DwC mapping of RATO vwz occurrences
MIT License
0 stars 1 forks source link

gbif codes outdated or not correct: manually map taxonomic information? #24

Open damianooldoni opened 1 year ago

damianooldoni commented 1 year ago

It seems that some GBIF codes are not pointing to what the Dutch names refer to, e.g. Waterteunisbloem should point to Ludwigia grandiflora, while the provided gbif_code is pointing to the genus.

My opinion is to not use column gbif_code as it can become outdated. RATO could also drop it in their database. As the number of species is quite limited (<30) a manual mapping is the best solution, I think. We do so in POV datasets as well. In this way we get easily a warning if some new species pop up in the raw data as NA would occur in DwC output file and this will be detected by the specific test.

As we need to publish fast now, I will correct the GBIF codes in the mapping as a patch.

PietrH commented 11 months ago

As is the case for this record:

Dossier_ID OBJECTID Dossier_Status Domein Soort Waarneming Actie Materiaal_Vast Opmerkingen_admin Opmerkingen Melder_Naam Melder_Klant Planning_Datum X Y Gemeente Aard_Locatie GBIF_Code Dossier_Link Dossier_Link_ID Hoofddossier_ID Aangemaakt_Datum Laatst_Bewerkt_Datum Datum_Van Geometrie_Type Shape
460271028 589775 Opvolging Plant Mantsjoerese wilde rijst NA NA NA NA NA NA Andere NA 95383.03 189125.1 Deinze Publiek 7901745 0 NA -1 2023-10-09 15:19:50 2023-10-09 15:20:11 2023-10-09 15:19:50 Point POINT (95383.02510000 189125.06350000)

It was mapped to the wrong taxonkey, I can look up the used taxonkeys for riparias on this page: https://alert.riparias.be/about-data

image

Which you can then lookup as so:


c('2978552', '2489005', '3190653', '2498252', '3084923', '2340977', '2706080', '5328593', '2502792', '3170247', '2440934', '3129663', '2882443', '2437394', '2437399', '3189935', '3169169', '4284921', '4417558', '2704521', '2482499', '5362054', '5329263', '2702865', '2765942', '5329212', '2225776', '7346102', '8930656', '8721209', '8909595', '8979506', '8971201', '5712056', '2350580', '2350570', '2984306', '6063677', '7287606', '3034825', '3628745', '3642949', '2434271', '5384931', '2984537', '7978544', '2891770', '8848208', '2865565', '9799308', '2394486', '8114276', '5855350', '2427091', '5421039', '5420991', '2650436', '2869311', '5289808', '2394604', '2440946', '4264680', '5361785', '5361762', '2433536', '2434552', '5219858', '2498305', '2226990', '3086784', '5828232', '2390064', '4033648', '3088310', '2870583', '7965247', '2766030', '2227289', '2227300', '9442269', '5218786', '5358460', '2362868', '2977647', '2486131', '5824863', '5274863', '5384932', '5219681', '5219683', '5035187', '5035230', '5035017', '2437450', '2480764', '2443002', '3054399', '1311477', '1315391', '5217334', '10919373') %>%
    purrr::map_dfr(~rgbif::name_usage(.x)$data) %>%
    View("riparias-taxa")

So I suggest mapping the vernacular name in "Soort" manually to the table created by parsing the list of LIFE RIPARIAS target species via a lookup table.

We currently already have a hardcoded list of species we expect in the tests:


testthat::test_that("scientificName is never NA and one of the list", {
  species <- c(
    "Ondatra zibethicus",
    "Fallopia japonica",
    "Castor fiber",
    "Gallus gallus domesticus",
    "Myriophyllum aquaticum",
    "Alopochen aegyptiaca",
    "Ludwigia peploides",
    "Martes foina",
    "Hydrocotyle ranunculoides",
    "Vespa velutina",
    "Heracleum mantegazzianum",
    "Rattus norvegicus",
    "Cairina moschata",
    "Anser anser domesticus",
    "Neovison vison",
    "Trachemys scripta",
    "Psittacula krameri",
    "Oryctolagus cuniculus",
    "Branta canadensis",
    "Branta leucopsis",
    "Anatidae",
    "Anser anser",
    "Impatiens glandulifera",
    "Myocastor coypus",
    "Lysichiton americanus",
    "Procambarus clarkii",
    "Ludwigia grandiflora",
    "Sciurus",
    "Crassula helmsii"
  )
  testthat::expect_true(all(!is.na(dwc_occurrence$scientificName)))
  testthat::expect_true(all(dwc_occurrence$scientificName %in% species))
})

We are also currently already overwriting some of the provided taxonids: GBIF_Code:

input_data %<>%
  mutate(gbif_code = case_when(
    soort == "Waterteunisbloem" ~ 5421039,
    soort == "Rivierkreeft" & 
      (str_detect(waarneming, "Rode Amerikaanse rivierkreeft") | 
         str_detect(opmerkingen, "Amerikaanse")) ~ 2227300, 
    TRUE ~ gbif_code
  )
)

In short, I support this idea. I think we should switch over to a manual mapping via a lookup table. I will do this, but see this as medium priority.

PietrH commented 1 month ago

Related to #207