microbiome / microbiomeDataSets

Experiment Hub based microbiome datasets
https://bioconductor.org/packages/3.13/microbiomeDataSets/
5 stars 4 forks source link

additional datasets + submission? #3

Closed FelixErnst closed 3 years ago

FelixErnst commented 3 years ago

Do you have any additional datasets you like to add/(have added) to microbiomeDataSets?

@antagomir @dombraccia @microsud @fionarhuang

Anything against submitting the datasets first after some cleanup of the Authors section in both DESCRIPTION and man pages? Let me know, if the current info needs to be mended (quick PR).

PS: Lets stick for now to strictly microbiome datasets (Multi-omics are fine as well), metagenomic ones go somewhere else 😄

microsud commented 3 years ago

For now, this seems okay. Later we may need to identify appropriate dense time-series and MultiOmics data in literature.

antagomir commented 3 years ago

Do you mean precalculated abundance tables + side info on rows and cols with "strictly microbiome data sets"?

In microbiome pkg we have datasets dietswap, atlas1006, and peerj32; these are all from my previously published papers, and they have all been measured with HITChip 16S phylogenetic microarray. It is a different measurement technology but it generates taxonomic abundance tables that can be presented in TSE. Including these might be good because they are used in many of our examples (not critical) and since it can help to demonstrate that this framework is applicable to any taxonomic abundance table data (whether microrrray or sequencing based). On the other hand, data from phylogenetic microarrays may confuse those who are not familiar with this. -> I have now solved this by adding a more explicit mention and references regarding the measurement technology.

antagomir commented 3 years ago

All ok now. I am just wondering if the current naming scheme (LahtiML - LahtiWA - OKeefeDS) is sufficiently clear. It is systematic at least, but is this required for ExperimentHub? Many R data sets are known with a more simple name (starting from "iris").

antagomir commented 3 years ago

Many widely-known demo data sets are available in phyloseq format in MicrobeDS. Providing these as TSE might also be quite useful. Not critical since the conversion can be done with a single line with makeMicrobiomeExperimentFromphyloseq (I guess that function name should be updated now as ME was abandoned as a class?).

Including some of these might be useful because having only HITChip-based demo data sets may be confusing.

FelixErnst commented 3 years ago

I think that the fact that the three datasets have been measured with the HITChip microarray isn't really a problem. I guesss, that the datatypes are the same as in any microbiome data sets.

Regarding the structure of the data, ExperimentHub comes with a few requirements one has to dig through, starting here. Best practice is to keep the overhead as small as possible, but as easily useable. So in this case, the data is chopped up into csv files and reassambled on the fly. this allows the individual retrieval of data and remixing it where needed. The LahtiML data is a good example: the data can be retrieved either as MAE or TSE depending on whether the lipid data is included as well or not. makeMicrobiomeExperimentFromphyloseq is a useful function for a user, but in this case there are a few steps more to generate the individual files for the experiment hub (They are currently located in inst/extdata/hub).

What is not a requirement, is how the data is presented to the user aka. the function name. There are a few experiment hub packages out there, which use a similar nomenclature (author + datatype), so I just used that. But this can be changed at any time (even after submission). ExperimentHub package version numbers also don't increase with every release.

Regarding the MicrobeDS I have mixed feelings: the package itself is of course opensource CC0, but the origin of the data is behind a login protected website. We could add the data very easily, but I am not sure right now.

antagomir commented 3 years ago

I would tend to propose to use a more intuitive data set naming scheme, for instance OKeefeDS would become dietswap - I also do not think that it is a problem to have same name than these data sets had in microbiome pkg because the object class is obvious in R. Anyway, I am not sure how necessary change this would be, so if there is a general reference to the author name-based naming scheme then that should be fine, too I guess.

FelixErnst commented 3 years ago

The dietswap name is of course good, since you are using it many samples. However two things have to be kept in mind:

If you want you can add this function by adding these lines to the datasets on the branch you are working on:

#' @rdname the-dataset-man-page-name
#' @export
dietswap <- DataSetFunctionName

after rebuilding the package locally, you can test this by calling dietswap()

FelixErnst commented 3 years ago

@antagomir I merged #4

Do you want introduce the aliases as well?

dombraccia commented 3 years ago

Hey @FelixErnst ,

I am unsure what the difference between microbiome and metagenomic data sets is.. to my knowledge, a microbiome dataset is just metagenomic information taken from the gut or oral cavity or somewhere that is considered a microbiome.. clarification on this point would be appreciated.

Also, are you looking to fill a gap that curatedMetagenomicData does not already fill? They seem to have lots of samples already available and it looks like they are going to be adding more in the near future. Still happy to help if there are some data that would be worth processing and including.

FelixErnst commented 3 years ago

That was actually what I was hinting at. curatedMetagenomicData contains datasets, which have as the name says its origin in metagenomic studies. So I think there is a gap there for 16S and other sources identifing microbiome composition from "single" genes.

In addition the data structure they use, is not immediately compatible with, what me/I am aiming for. The fact that they store actual R objects hinders development, since the object structure needs to be fixed before data can be uploaded. With the approach in this package, we are assembling the objects from general data structures (csv, DataFrame and in the future (newick?) tree formats and fasta files), so that, if a class definition gets updated, we don't need to update the data, just the way it is assembled into a TSE or other objects such as MAE. For big datasets, we can also use h5 files without a problem.

So to sum up: a bit of both 😃

edit: the MicroDS package actually looks a bit like something more related to this data package, than curatedMetagenomicData, am I right?

antagomir commented 3 years ago

Yes, microbeDS package is more like this one, at least when compared to curatedMetagenomicData.

Aliases: ok these could be added but I think I am a bit biased in giving opinions here since I already got used to those data set names. Hence I wanted to hear feedback. Ok, I can do this if doesn't matter for anyone. But shall we change the names, or make aliases? I would just change the names to keep it simple (if we change the naming scheme in the first place)?

FelixErnst commented 3 years ago

My focus is clearly on future proofing this. I know, that first author + data type abbreviation also has its limitations, but with just the data type we would run out of options and would potentially need to resort to dietswap2 or other means to distinguish the data earlier, than with the "new" scheme.

From my point of view aliases would be best for the established datasets and using the "new" scheme for additional datasets.

antagomir commented 3 years ago

Ok let's do aliases. These can be done for any data set always.

microsud commented 3 years ago

I think the scope of the microbiomeDataSets requires some discussion. The curatedMetagenomicData is well defined, where the authors use raw sequencing data and process (bioinformatically) in identical ways which are important for meta-analysis. Currently, microbiomeDataSets does not define this explicitly. 16S rRNA data can be processed in different ways (dada2, deblur, OTU approach) to get to abundance tables. Without proper SOP, these data are unique by study and cannot be compared/useful for metaanalysis. If the scope of microbiomeDataSets is to provide example datasets for development of mia and related tools, this is OK, but if the aim is to provide a 16S based curatedData, then we have to set some standards for raw sequencing data processing, minimal metadata information, sequencing platform, variable regions sequenced, DNA extraction methods, etc. In addition, specifying multi-omics is important. It can be 16S rRNA survey+metatranscriptomics+other Omics or it can be WGSmetagenomics+metatranscriptomics+other Omics or other combinations.

You maybe aware of these, but for reference, I add these here. Regarding terminologies/accepted standard definations: Microbiome definition re-visited: old concepts and new challenges
The vocabulary of microbiome research: a proposa

E.g. Resources with common 16S rRNA sequencing data processing:
HMP2Data
QIITA
expanded Human Oral Microbiome Database (eHOMD)
microbiomeHD (several of these are made available as phyloseq objects in microbiomeutilities pkg with permission for the first author)

If we aim for 16S based curatedData we need to identify a single or multiple (for comparison) tools to process the raw data to get abundance tables which can then be made available via microbiomeDataSets. For this, we can also collaborate with developers of such tools. Let me know your thoughts on this. Edit: Processing of raw sequences using dada2 can be an option since it is also on BioC.

FelixErnst commented 3 years ago

Thats a valid point and maybe I am lacking some commonly used vocabulary. I am sorry for that.

With this package I didn't have in mind, that we pursue the curated datasets road, but rather provide datasets from studies for examples, tutorials and maybe comparison, where possible.

I don't have the time to curate datasets and I think this really hard to do, since sequencing technologies, lets include arrays here as well, changed dramatically. However, if some things there is value to do it, I am happy to provide some technical support, if ExperimentHub or other public storage locations are used.

microsud commented 3 years ago

No issues! I also think the curating part is a big task and currently out of scope for this project. Maybe something for the future ;) The HMP2Data is a good resource for tutorials etc. and can be easily converted to TSE.

FelixErnst commented 3 years ago

Do you know jstansfield?

microsud commented 3 years ago

Nope :(

FelixErnst commented 3 years ago

The HMP2Data contains some interesing data. However, in this package the data is stored as R objects as well, so that it is not necessarily future proof.

Should we go ahead with the submission? Please keep it in mind, that it usually takes a month.

antagomir commented 3 years ago

It would be good to have some other (sequencing based) example data sets besides the HITChip phylogenetic microarray data because that is a bit unconventional measurement platform, and might be confusing as most users will have sequencing-based data. On the other hand, new data can be added once the package has passed the review.

FelixErnst commented 3 years ago

Quick question on the HMPv13/v35 dataset in MicrobeDS:

I know the source of the count table and annotation. However, I am not sure where the originates from. Does anybody have additional insight into its origin?

microsud commented 3 years ago

Sorry unaware of the details...

FelixErnst commented 3 years ago

Also something to consider is the fact that HMP data is already available from Bioconductor:

http://bioconductor.org/packages/release/data/experiment/html/HMP16SData.html

Sadly it is hardcoded as a SummarizedExperiment object and the tree data is not on ExperimentHub, so there is no easy way of creating a TSE.

My guess would be that, an additional dataset gets flagged during submission. In addition I am not sure, who and why the raw files used in HMP16SData got deposited under the URL http://downloads.ihmpdcc.org/data/HMQCP/* . It seams to be a URL not generally used and so I guess the data there was custom made.

FelixErnst commented 3 years ago

The SongQA dataset was added and the PR is open. HMPC datasets also got into a seperate branch, but I am not sure, whether to merge or not.

microsud commented 3 years ago

We can keep the SOngQA for now. HMPC can go later if required. Will create a new issue to highlight potential datasets for different types of research questions. I can also help with incorporating them in this pkg.

FelixErnst commented 3 years ago

It would be great, if you could add some datasets. I noticed that you have worked on some other dataset packages previously, so maybe that can be added as well. Feel free to open a PR. If you need some pointers for the data preparation, let me know.

I am waiting on the review from Leo @antagomir and I will merge the SongQA PR then.

antagomir commented 3 years ago

Review done. Indeed we need (good) demo data sets and if @microsud can make some PRs that would be great!

FelixErnst commented 3 years ago

I am in the process of updating other data. Therefore I am set up to get this data also in the ExperimentHub. Any comment on this? How do you see the current state?

Should we wait for the answer on the Sailani dataset and then move to the submission?

antagomir commented 3 years ago

Submitting ok to me, good to get fwd

microsud commented 3 years ago

Yup we can submit. Sailani dataset can wait for now...

FelixErnst commented 3 years ago

Yup we can submit. Sailani dataset can wait for now...

OK. We can add the data afterwards as well. However, to make this clear to the Bioc-Team I will restructure some things. You might need to update the iPOP branch in that event.

microsud commented 3 years ago

Yes. No issues!

FelixErnst commented 3 years ago

I uploaded the data to the AWS bucket in preperation for the submission process.

For reference more information on how I did it can be found here: http://www.bioconductor.org/packages/release/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html#uploading-data-to-s3

FelixErnst commented 3 years ago

see https://github.com/Bioconductor/Contributions/issues/1853