First mapping - Githubissues

LienReyserhove commented 6 years ago

This is a first attempt to map the adhoc checklist data. @peterdesmet now ready for review! This is an overview of all mapped DwC terms, remaining questions and issues

Record level terms

References: See #12 For now, I used the identifiers, not the full bibliographic citation. For the full bibliographic citations, I used the literature references extension. However, many identifiers are lacking, so the mapping is still unfinished and has to be revised when all information is available.

[x] language
[x] license
[x] rightsHolder
[x] accessRights
[x] references
[x] datasetID
[x] institutionCode
[x] datasetName

Taxon core

[x] taxonID
[ ] scientificName: cleaning step performed in raw data, see #18
[x] kingdom
[x] phylum
[x] order
[x] family
[x] genus
[x] taxonRank
[ ] ~~vernacularName~~
[x] nomenclaturalCode

Literature reference extension:

For now, this extension only contains the full bibliographic citations. I did not include the identifiers yet. Whether or not the identifiers can/will be integrated depends on the raw data. See #12

[x] taxonID
[x] bibliographicCitation

Distribution extension:

The fields locationID and locality are based on the information in location in the raw GS file. Not all fields were populated. I assumed that in these cases, locality = Belgium

[x] locationID
[x] locality
[x] countryCode
[x] occurrenceStatus
[x] establishmentMeans
[x] eventDate
[x] source

Species profile extension

[x] taxonID
[x] isMarine
[x] isFreshwater
[x] isTerrestrial

Description extension

Native range

Some basic cleaning steps were performed in the raw data (e.g. use | as a separator). Other cleaning steps (e.g. change lower case to capital) were performed in the R-script.
Not all terms match the WGSRPD vocabulary. Some of these could be mapped to the UN geocheme. However, some of the terms were not listed in any of these vocabularies, so I left them as they are for now. Suggestions are welcome.

Mapped to WGSRPD standard:

Africa
Australia
Canary islands
Central America
China
Costa Rica
Cyprus
Eastern Asia
Europe
Hawaï
Japan
Mexico
New Zealand
North America
South America
Southeastern Europe
Southern Africa
Tasmania
Vietnam

Mapped to or matches with UN geoscheme:

East Africa
East Asia
America
Asia
Southeast Asia
Southeast Asia
Southern Asia
Southern Europe

Doesn't match any of the vocabularies above:

cosmopolitan
Iberia
Mallorca --> Baleares?
mediterranean
Middle America --> Central America?
North Pacific Ocean
Pantropical

Pathway of introduction

Data were already mapped to the CBD standard

Invasion stage

see #16

peterdesmet commented 6 years ago

[x] I would remove references
[x] I would use Brussels-Capital Region, Flemish Region, Walloon Region: probably best to do this in script, to keep spelling simple in source
[x] Southeast Asia appears twice, probably because of ?. Best to perform a trim on all values

Questions

[x] Is the script able to parse | without pipes? I noticed one instance where the space was missing (now corrected in source data).
[x] What happens if two records of the same species (because of different distribution) differ in e.g. order? Are two taxa created? If so, I would double check if there are no duplicate taxonIDs
[x] Should I see any info regarding the UN geoscheme in the locality?

Data issues:

[x] Some names should be subspecies and Haplodrassus should be genus. I've corrected this in the source data
[x] eventDate: 2005-2009/2018
[x] Name Lacerta viridis (Laurenti, 1768) / bilineata to be corrected

LienReyserhove commented 6 years ago

To reply:

I would remove references

I would not. I think we really need this as we don't have a good, short identifier for each species (e.g. when we don't have a doi available). This is then the most complete and best information we have.

Is the script able to parse | without pipes? I noticed one instance where the space was missing (now corrected in source data)

Do you mean "without spaces"? If so, yes, it works even when a space is missing. You can try it with this script (where you can add or remove spaces between test1 and test2

data_frame <- as.data.frame(matrix(c("test1|test2", "test3|test4" ),
                dimnames = list(1:2, "test")))

separate(
  data = data_frame,
  col = "test",
  into = c("column_A", "column_B"),
  sep = "\\|")

What happens if two records of the same species (because of different distribution) differ in e.g. order? Are two taxa created? If so, I would double check if there are no duplicate taxonIDs

In the case of one single species with two distribution records, the code will create only one taxonID for that species and the taxon core will contain only one record for that taxon.

Should I see any info regarding the UN geoscheme in the locality?

We could do that, but I'm not 100% fond of this, as the mapping will then be a combination between several standards or no standard at all. I would keep it limited to the WGSRPD vocabulary

peterdesmet commented 6 years ago

As discussed: agree on all, I would just drop the field references, but keep the extension references.

LienReyserhove commented 6 years ago

Agree for the references!

LienReyserhove commented 6 years ago

@peterdesmet ready for review! You will need to run the script again. Due to the changes in the taxon core (removal of references), I was not able to exclude changes in UTF-8 encoding in the processed files.

peterdesmet commented 6 years ago

Nice! Merge away 🚢

trias-project / ad-hoc-checklist

First mapping #17

Record level terms

Taxon core

Literature reference extension:

Distribution extension:

Species profile extension

Description extension

Native range

Pathway of introduction

Invasion stage