trias-project / unified-checklist

🇧🇪 Global Register of Introduced and Invasive Species - Belgium
https://trias-project.github.io/unified-checklist/
MIT License
0 stars 1 forks source link

Problems with duplicate descriptions due to different vocabularies? #27

Open LienReyserhove opened 5 years ago

LienReyserhove commented 5 years ago

To select unique descriptions in the unified checklist, we apply the following code to select the descriptions across checklists (section 6.5 point 3):

 # Group by type and verificationKey across checklists
  group_by(
    type,
    description,
    verificationKey
  ) %>%

  # Select first datasetKey, taxonKey and scientificName
  summarize(
    datasetKey = first(datasetKey),
    taxonKey = first(taxonKey),
    scientificName = first(scientificName)
  ) %>%

By grouping by both type, description and verificationKey, we risk to select duplicated descriptions due to the use of different vocabularies. An example:

verificationKey type description taxonKey
a native range Northern America 1
a native range Southern America 1
b native range North America 1

Here, all descriptions for this species will be selected, due to the use of a different vocabulary. To be considered....

peterdesmet commented 5 years ago

Indeed. Descriptions should be cleaned before group by. Can be tackled after first publication.