Open pbuttigieg opened 3 years ago
@79-6d Do you have something that is in MIxS already? If so I suggest we work with that? If would be great if we can come up with a notebook that does most of the conversion automatically based on what's in the spreadsheet.
Just realised that the dataset we have is from MIxS v2, I will write to our data provider to ask if she has something from MIxS v5.
Great idea about working on it in a notebook!
For reference, here are two examples of test datasets in the GBIF test environment that use the DNA derived data extension. There are one marine and one terrestrial, the latter use the extended measurement and fact in addition to the DNA extension. Be ware that these are prepared as tests and have very minimal EML metadata.
SMHI Baltic Picoplankton (Marine) - about: https://www.ebi.ac.uk/ena/browser/view/PRJEB12362
Insect mobile (Terrestrial) - about: https://www.biorxiv.org/content/10.1101/2020.11.19.389742v1
Awesome!! Do you mind to share the link to the repo on how the conversion is made if that's available?
How would the eml part be addressed? Do you get the data provider to fill in those information or is that something that will be extracted from the data?
@79-6d both of these datasets were uploaded by the publishers through IPTs.
The extension is of course not in production, but IPTs running in test mode detects the extension and can map to it. The EML can be filled through the IPT as well through a form - I guess they just didn´t spend time on this yet as this is pre-production.
I found a suitable marine 'omics dataset that we can use to look at the conversion from MIxS to DwC in our biodiversity.aq/POLA3R database. It is a microbial dataset where the authors used 16S rDNA amplicon sequencing to profile the community composition of Bacteria and Archaea in marine sediments. I think it's a good representative of a typical (small) microbial DNA-based dataset.
Here is the .xlsx file how we formatted it as MIxS, I adapted it to MIxS v5 MIxS_testdataset_PRJNA335729.xlsx
This dataset was published in: Franco, D. C., Signori, C. N., Duarte, R. T., Nakayama, C. R., Campos, L. S., & Pellizari, V. H. (2017). High prevalence of gammaproteobacteria in the sediments of admiralty bay and north bransfield Basin, Northwestern Antarctic Peninsula. Frontiers in microbiology, 8, 153.
here is the link to the IPT: https://ipt.biodiversity.aq/resource?r=antarctic_marine_sediment_microbes
The sequences can be retrieved from here: https://www.ebi.ac.uk/ena/browser/view/PRJNA335729
Thanks @msweetlove , I can´t open the xlsx file in Neither Excel nor Google sheets - Can you check the file is as expected?
I can open it just fine... Here is a version saved as tab separated txt, does this work?
MIxS_testdataset_PRJNA335729.txt
The original is a csv, but for some reason GitHub doesn't allow that format.
Yes - the txt file works for. Thanks!
Is there any taxonomic annotation of the sequences available? In the paper it says
At the phylum level, all OTUs could be classified and belonged to 22 formally described bacterial phyla and 18 candidate phyla
But I cant seem to find any information on the classification step (database, thresholds etc)
I don't have any more information than that... Like most microbial studies, these authors only provide the raw sequence data because the methods to bin/cluster sequences, detect errors and taxonomically annotate sequences vary widely from lab to lab and the techniques evolve very fast over time... You can always try to contact the authors if they still have the original OTU tables, or use your own pipeline to annotate the sequences, or you can also request an analysis at MGnify: https://www.ebi.ac.uk/metagenomics/
OK, thanks. I just wanted to be sure that I didn´t overlook anything.
@msweetlove the dataset is now in the GBIF test environment here:
This one is using the extension from this repo with the MIXS IRIs
@timrobertson100 @thomasstjerne @cmungall @pieterprovoost @79-6d
Following our meeting today, would you mind scoping out how you'll test exchanging a MIxS TSV for an attempt to auto-convert it into a DwC Archive?