trias-project / rinse-pathways-checklist

🚢 RINSE - Pathways and vectors of biological invasions in Northwest Europe
https://trias-project.github.io/rinse-pathways-checklist
MIT License
0 stars 0 forks source link

Cleaning steps references #4

Closed LienReyserhove closed 6 years ago

LienReyserhove commented 6 years ago

Some feedback needed:

The Zieritz et al. (2016) checklist has a referencecolumn containing numbers. Two things with respect to that:

  1. The numbers are separated by comma's and hyphens. The hyphen is used to indicate a sequence, i.e. 1-4 refers to references 1, 2, 3 and 4. We need the latter. I didn't figure out yet how I can generate these sequences in an way that makes the code readable. Thus, I suggest to generate the sequences in the raw data file, rather than performing the cleaning in the R script (which makes it more messy). As this is a dead dataset, I think the cleaning step won't harm.

  2. For some species, about 12 reference numbers are provided, which is a lot. Just to be sure, is it really necessary to integrate the full reference? The fields will be full of text, but I guess there's no other way around that right?

LienReyserhove commented 6 years ago

Answer by @peterdesmet:

Regarding 1: pick 🧠 of @stijnvanhoey

Regarding 2: let's say we have 30 unique references

  1. As here, I would create a separate sources.csv file with those 30 sources.
  2. It should have the columns:
    • number: 1, 2, 3 as in data
    • identifier: DOI, or other link
    • full_reference: Pensoft written citation
  3. For each source, populate the identifier with the DOI. If none is available, try to find a pdf link. If that is not available, create a unique code (e.g. smith_2016)
  4. Use the references extension to link taxa to their sources. Only populate taxonid, identifier and bibliographicCitation. This extension will contain many duplicates.
  5. Use the identifier (DOI) in the distribution extension to populate source. Multiple sources should be separated with space pipe space |