tdwg / gbwg

Genomic Biodiversity Interest Group
Apache License 2.0
15 stars 2 forks source link

Pilot conversion of MIxS TSV into a GBIF/OBIS system #53

Open pbuttigieg opened 3 years ago

pbuttigieg commented 3 years ago

@timrobertson100 @thomasstjerne @cmungall @pieterprovoost @79-6d

Following our meeting today, would you mind scoping out how you'll test exchanging a MIxS TSV for an attempt to auto-convert it into a DwC Archive?

pieterprovoost commented 3 years ago

@79-6d Do you have something that is in MIxS already? If so I suggest we work with that? If would be great if we can come up with a notebook that does most of the conversion automatically based on what's in the spreadsheet.

ymgan commented 3 years ago

Just realised that the dataset we have is from MIxS v2, I will write to our data provider to ask if she has something from MIxS v5.

Great idea about working on it in a notebook!

thomasstjerne commented 3 years ago

For reference, here are two examples of test datasets in the GBIF test environment that use the DNA derived data extension. There are one marine and one terrestrial, the latter use the extended measurement and fact in addition to the DNA extension. Be ware that these are prepared as tests and have very minimal EML metadata.

SMHI Baltic Picoplankton (Marine) - about: https://www.ebi.ac.uk/ena/browser/view/PRJEB12362

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)

Insect mobile (Terrestrial) - about: https://www.biorxiv.org/content/10.1101/2020.11.19.389742v1

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)
ymgan commented 3 years ago

Awesome!! Do you mind to share the link to the repo on how the conversion is made if that's available?

How would the eml part be addressed? Do you get the data provider to fill in those information or is that something that will be extracted from the data?

thomasstjerne commented 3 years ago

@79-6d both of these datasets were uploaded by the publishers through IPTs.

The extension is of course not in production, but IPTs running in test mode detects the extension and can map to it. The EML can be filled through the IPT as well through a form - I guess they just didn´t spend time on this yet as this is pre-production.

msweetlove commented 3 years ago

I found a suitable marine 'omics dataset that we can use to look at the conversion from MIxS to DwC in our biodiversity.aq/POLA3R database. It is a microbial dataset where the authors used 16S rDNA amplicon sequencing to profile the community composition of Bacteria and Archaea in marine sediments. I think it's a good representative of a typical (small) microbial DNA-based dataset.

Here is the .xlsx file how we formatted it as MIxS, I adapted it to MIxS v5 MIxS_testdataset_PRJNA335729.xlsx

This dataset was published in: Franco, D. C., Signori, C. N., Duarte, R. T., Nakayama, C. R., Campos, L. S., & Pellizari, V. H. (2017). High prevalence of gammaproteobacteria in the sediments of admiralty bay and north bransfield Basin, Northwestern Antarctic Peninsula. Frontiers in microbiology, 8, 153.

here is the link to the IPT: https://ipt.biodiversity.aq/resource?r=antarctic_marine_sediment_microbes

The sequences can be retrieved from here: https://www.ebi.ac.uk/ena/browser/view/PRJNA335729

thomasstjerne commented 3 years ago

Thanks @msweetlove , I can´t open the xlsx file in Neither Excel nor Google sheets - Can you check the file is as expected?

msweetlove commented 3 years ago

I can open it just fine... Here is a version saved as tab separated txt, does this work?

MIxS_testdataset_PRJNA335729.txt

The original is a csv, but for some reason GitHub doesn't allow that format.

thomasstjerne commented 3 years ago

Yes - the txt file works for. Thanks!

thomasstjerne commented 3 years ago

Is there any taxonomic annotation of the sequences available? In the paper it says

At the phylum level, all OTUs could be classified and belonged to 22 formally described bacterial phyla and 18 candidate phyla

But I cant seem to find any information on the classification step (database, thresholds etc)

msweetlove commented 3 years ago

I don't have any more information than that... Like most microbial studies, these authors only provide the raw sequence data because the methods to bin/cluster sequences, detect errors and taxonomically annotate sequences vary widely from lab to lab and the techniques evolve very fast over time... You can always try to contact the authors if they still have the original OTU tables, or use your own pipeline to annotate the sequences, or you can also request an analysis at MGnify: https://www.ebi.ac.uk/metagenomics/

thomasstjerne commented 3 years ago

OK, thanks. I just wanted to be sure that I didn´t overlook anything.

thomasstjerne commented 3 years ago

@msweetlove the dataset is now in the GBIF test environment here:

  1. Dataset in GBIF
  2. A Sampling Event (scroll down to see taxonomic breakdown)
  3. An occurrence (scroll down to see the MIxS data)

This one is using the extension from this repo with the MIXS IRIs