tdwg / gbwg

Genomic Biodiversity Interest Group
Apache License 2.0
14 stars 2 forks source link

Basic test CALeDNA record for DwC and ABCD #62

Open gdadade opened 3 years ago

gdadade commented 3 years ago

I did a test mapping using the event core for one environmental sample record with 10 example identifications from GGBN's partner CALeDNA.

Thoughts for dicussion:

  1. Besides of the missing BasisOfRecord in the event core, there are also following important collection terms missing: collectionCode, catalogNumber
  2. Which term should I use when providing the collectionObjectGUID, I guess materialSamplID which is also not included in the event core?
  3. Since I want to add scientific name information I'd rather use the identification extension, but unfortunately one cannot use it with the event core; also I don't know if this extension supports preferred=true for ALL identifications related to one event?
  4. I've used occurrence extension as suggested by GBIF but many fields are redundant to the event core, plus in case I want to use collectionCode etc. here I have to duplicate them a million times for each scientific name; to overcome this I wanted to use both occurrence and identification extension, but see above
  5. Since this is a basic test, I did not include the Resource Relationship Extension yet
  6. How to add a sequence for each scientific name? Since the identification is based on BLAST this is important information.
  7. Note: I did not use the dna derived data extension yet, as all parameters needed for this test mapping exist in GGBN extensions already.

dwca-caledna_test-v1.1.zip

In comparison see ABCD file for same test record

  1. Basic collection object information provided only once
  2. Scientific Names mapped in the identification class where they belong to
  3. PreferredFlag for identifications not used -> preferred = true for all (default, supported by BioCASe and GGBN)
  4. Sequences can be added to each scientific name when using the GGBN extension
  5. Note: no GGBN extension and UnitAssociation used for this basic mapping; we use ABCD2.1 (and interim version for GGBN until ABCD3.0 is ready for usage). I can't upload xml files, so I zipped the ABCD

calednaabcdggbn.zip

thomasstjerne commented 3 years ago

In DwC I would use Occurrence core rather than Event core in order to be able to attach DNA sequences to the identifications through either the GGBN amplification extension or the DNA derived data extension (once it is available) This will of course this will duplicate Event fields across Occurrence rows, but otherwise you can´t link the DNA sequences.

In the Occurrence core file I would have the following terms:

gdadade commented 3 years ago

So if I were to use the same eventID for alle "occurrences" that belong together GBIF would recognize this as an event sampling? If so, why do we need an event core than? Still this would mean I have to double primary occurrence data one million times if I have one million taxa in a sample.

thomasstjerne commented 3 years ago

Yes, that would be recognized as an event. Example

The reason we need Event core is that we loose the information of parent events when flattening event data to Occurrences. In a future richer model than the Star schema, we want to be able to model hierarchical events.

But for now you can only choose one Core, i.e. do you want rich occurrences with DNA sequences or do you want to avoid data duplication and use Event core.

gdadade commented 3 years ago

Ok thanks. If I click on "77 occurrences" it takes me to occurrences, but the parameter "event_id" disappears from url and all 30.091 occurrences are shown. Is this not yet implemented?

thomasstjerne commented 3 years ago

That was a bug in the portal - fixed now.