plazi / arcadia-project

2 stars 1 forks source link

lycophron test set: staphilinidae images, sequences #234

Open myrmoteras opened 8 months ago

myrmoteras commented 8 months ago

Aslak Kappel Hansen akhansen@snm.ku.dk

Rudolf Meier Rudolf.Meier@hu-berlin.de; Donat Agosti agosti@amnh.org

Hi both,

Thanks for the talk today.

I prepared a test dataset to show what is available. This is not the final data, but will give an idea about what is there and how it is currently structured. The test currently only has the first 200 specimen, but should give an idea of the whole data.

https://www.dropbox.com/scl/fo/ryae98rikzynhg26gh7ii/h?rlkey=zjbkgxnq4rpy1bxldysaady03&dl=0

It includes:

What is still missing are genitalia preparation images. I’m still preparing these, but will try to get some examples for you to see.

Please let me know what you think of the current structure or if something should be added or modified.

For the literature I will similarly try to prepare a subset of what is available and share with you as soon as possible.

Cheers, Aslak

slint commented 8 months ago

New message from Aslak:

Hi all,

Thanks for the meeting today.

I have worked on preparing a new set of data taking into account our talk today.

Here is how I see it (see attached figure ZenodoData.jpg):

image

We have a CSV including specimen and their core data (CoreData.xlsx), additionally we have a number of additional data files that are given together with their metadata (I have attached two examples, ImageData.xlsx (metadata for 2 images) and BarcodeData.xlsx (metadata for 1 barcode)). These are linked to the specimen and its core data through a Zenodo COI (correct?), additionally these each get their own DOI. All CoreData is attached to these additional data files as additional metadata.

The basis of the record should thus be the specimen and the core data attached to this. If a data file needs to be given to this then this could either be an image of the habitus or of the labels. I would suggest the label though, as this would show the identifier (catalogNumber).

CoreData.xlsx is given in Darwin Core, GBIF required (orange fields) and recommended fields (yellow) for occurrence datasets are highlighted. The first line is an example from GBIF, the second line is for one of my samples (GUASTA0000007). In the example the list is pretty exhaustive and in many cases the data provided may be much more limited. This was done to give an idea of what could be included as metadata.

ImageData.xlsx is mostly given in the Audiovisual Core format, additionally I think all core data (CoreData.xlsx) should be attached as metadata for these as well. I have given two examples of the same file in two different formats. One could think of other cases, but I think that data will fit within these fields in most cases. Btw I looked at my .jpg image files and see that metadata was somehow wiped. In my raw .tif files the metadata is still present.

BarcodeData.xlsx gives on example of metadata for a barcode fasta file. I could not find a core format for these terms, but have taken inspiration from GenBank. That said there are a number of fields that are relevant for our data, but not present in GenBank. Again, I think all core data should be attached as metadata for these files.

One question I have is whether it would be best to first create the core data of physical specimen and later attach the additional associated files through linking to the core data (specimen) Zenodo DOI.

Lastly in my overview figure I have added 'gene 2' and '3d scan', these represent that any number of additional files and types of data relating to the specimen voucher and its core data can be added. This is currently not relevant, but the future idea is for all these can be included and linked as described above (a data file and a metadata file relevant for that data type).

For now I have not touched the clustering anymore, as I believe these are a secondary part and something we should discuss in more detail. In many ways the taxonomic treatment can be either based on clustering creating species hypothesis without necessarily attaching species names instantly. Once a cluster is assigned a species name then this can be appended.

I hope all this makes sense. If not I’m happy to answer any questions here or through a zoom talk.

Link to data: https://www.dropbox.com/scl/fo/xzprivugpy6vxit3163za/h?rlkey=8xmpfo9ap0svbnqsvm1b0c9vq&dl=0

Cheers, Aslak

CNaseband commented 8 months ago

Data in .xlsx format is not quite the same as .csv, wrote an email to ask for actual .csv as this makes life a lot easier and from the sound of it converting it to an excel file (.xlsx) is an extra step for them anyway

slint commented 8 months ago

Here's the CSV we used for the production "Bats" collection upload:

transformed_data.csv

And here's the Google sheet it was exported from (I've made a copy of it so we can actually define on it the new input format, columns, etc.): https://docs.google.com/spreadsheets/d/1TUyDT6yOypX2DBuM_PNUZucFTC93uFlEa7PoAMYvnDI/edit?usp=sharing

CNaseband commented 8 months ago

Future data dumps will be provided in plain .csv files