plazi / arcadia-project

2 stars 1 forks source link

Sprint 3: lycophron MfN upload #236

Open myrmoteras opened 8 months ago

myrmoteras commented 8 months ago
myrmoteras commented 7 months ago

Dear colleagues @slint @CNaseband @lnielsen

Might it be possible to make progress on getting a version of the MfN physical specimen/images lycophron import done so we can discuss and hopefully by mid December publish the entire lot?

In order to charge the MfN for the work, possible only this year before mid December, we need to have this done.

Thanks

Donat

myrmoteras commented 6 months ago

@lnielsen

here the comments regarding the field names, especially specimen codes

Hi Donat, CoreData.xlsx

Just to understand. In the newest dataset (CoreData.xlsx) that I send I have used the Darwin core terms (attached here again). This would mean:

SpecimenID = catalogNumber - this is the unique ID given to the single specimen and is attached with a label to the physical specimen LabelID = recordNumer - this is a ID relating to a collecting event, e.g. Malaise trap or in my case sifting of litter from a given area- (here multiple specimen with individual catalogNumbers are present all with a single recordNumber). occurenceID = is the unique ID for GBIF to handle the data, from what I understand this can be a DOI (e.g. a DOI to Zenodo).

There is no Darwin core term for barcodes, but the SpecimenBarcode is a ID we give to the generation of the genetic barcode. This would be more relevant for Genetic databases (e.g. GenBank).

If necessary I could schedule a meeting with people in GBIF (while I’m in Copenhagen) to make sure these are correctly understood.

Hope this answers the question, otherwise let me know.

Cheers, Aslak

On 14 Dec 2023, at 15.25, Donat Agosti [agosti@amnh.org](mailto:agosti@amnh.org) wrote:

Hi Aslak A question regarding the Guatemala beatles: can you explain the various label, collection codes?

SpecimenID
LabelID SpecimenBarcode Accession Number = catalogueNumber

recordNumber http://rs.tdwg.org/dwc/terms/recordNumber catalogNumber http://rs.tdwg.org/dwc/terms/catalogNumber

occurrenceID http://rs.tdwg.org/dwc/terms/Occurrence

How are they related? Do you know the Darwin core terms for them? Do they refer to a single specimen or an assortment of specimens collected via leaf litter?

Thanks for a brief feedback` Donat

myrmoteras commented 6 months ago

Hi Donat and Lars,

Thanks for this.

Some quick remarks.

Here are the most recent files: https://www.dropbox.com/scl/fo/xzprivugpy6vxit3163za/h?rlkey=8xmpfo9ap0svbnqsvm1b0c9vq&dl=0

CoreData.xlsx is given in Darwin Core, GBIF required (orange fields) and recommended fields (yellow) for occurrence datasets are highlighted. The first line is an example from GBIF, the second line is for one of my samples (GUASTA0000007). In the example the list is pretty exhaustive and in many cases the data provided may be much more limited. This was done to give an idea of what could be included as metadata.

ImageData.xlsx is mostly given in the Audiovisual Core format, additionally I think all core data (CoreData.xlsx) should be attached as metadata for these as well. I have given two examples of the same file in two different formats. One could think of other cases, but I think that data will fit within these fields in most cases. Btw I looked at my .jpg image files and see that metadata was somehow wiped. In my raw .tif files the metadata is still present.

BarcodeData.xlsx gives on example of metadata for a barcode fasta file. I could not find a core format for these terms, but have taken inspiration from GenBank. That said there are a number of fields that are relevant for our data, but not present in GenBank. Again, I think all core data should be attached as metadata for these files.

Hope these comments can help for an improved second version.

Cheers, Aslak

myrmoteras commented 6 months ago

Hi Aslak

Thanks, this is very helpful and we will update and reload.

It is very helpful to have your resolve of all the various identifier attached to a specimen. This is important to document well so not to insert confusion later.

Regarding the physical object: We need to have a digital objet to upload as the evidence of the deposit. In Torsten’s case it is just a txt file including the catalogNumber – it could also be an image of the label. In your case, do you have only one specimen on a pin – unlike we do in ants?

Thanks as well for the data for the FASTA file. I tried to find the metadata and vocabularies to be added through my channels, but did not succeed. Lars and I discussed what to do with it, but because we didn’t had the right metadata and has also a question why we want to create a parallel system to GenBank or BOLD we added it to the physical object. For me, logically I would create individual deposits, though we create a lot of files. If we decide to do it, we should consider writing applications that allow to retrieve the sequence in a format that can be readily be used for analysis.

Lars and I will start to write a documentation after the new year, and if we agree then start a next upload

The related works is a sandbox issue that works once it is live.

All the best Donat