qiita-spots / qiita

Qiita - A multi-omics databasing effort
http://qiita.microbio.me
BSD 3-Clause "New" or "Revised" License
120 stars 80 forks source link

Pull EBI study into Qiita #2322

Open sjanssen2 opened 7 years ago

sjanssen2 commented 7 years ago

Our lab's strength is do easily do meta analysis of existing studies. Currently, we can do that without much effort with all studies in Qiita. Sadly, not everyone is using Qiita. Thus, there are exciting studies whose sequence files are publicly available in EBI but not in Qiita. It would be awesome, if there exists a way to pull those EBI projects into Qiita as a new study (sequence data + metadata) to extend the pool of samples for metadata analysis.

I am asking, because Sarkis' lab is asking to do some meta analysis of existing PD studies and at least two are in EBI but not in Qiita. Thinking a little bit ahead would mean to compare not only PD but also other auto immune diseases. Thus, it is very likely that we want to add in even more projects.

antgonza commented 7 years ago

👍

My main question is how to actually do this. Some tentative steps:

  1. User creates study
  2. User can add Study EBI accession (should this be the only point of entry?)
  3. A qiita plugin (or should this be part of Qiita?) validates and retrieves all samples/runs and sorts them by prep (experiment accession) and adds everything to Qiita. At the end you create a study with a sample and (possibly) multiple preps and all the runs as sandboxed. Obviously, accessions are added to the study so it's shown as submitted.
  4. The user can ask an admin to make public.
josenavas commented 7 years ago

I think this is a good idea, I also like the general implementation structure that @antgonza proposes. I think the entry point should be different: instead of the user creating the study, add a new page that is "Import study" and the user provides the Study EBI accession number. The title and most of the other information is already available in EBI, so we should be able to populate those entries.

In addition, when the accession number is provided, the system should check if that EBI accession number is already present in Qiita - this way we avoid duplicated studies (for example, if the study is still sandboxed and needs to be made public).

Finally, I think the code can live in Qiita (as the current code for EBI submission lives in Qiita), and use the internal plugin system (i.e. Qiita as a Qiita plugin :P).

adswafford commented 7 years ago

Daniel and I had a conversation about this issue last week and early this week where we had an alternative idea which was to start pulling in all the 16S V4 studies from EBI (that we didn't submit already ourselves) and making them public so we don't have data duplication or rely on individual users to pull them in manually. If we want to semi-automate things and let user-desire drive the prioritization of which studies to pull in then I think the implementation described above is fine, but if we instead want to just set things up to pull it all in then this could save us the time of making a UI for users to do it.

jdereus commented 7 years ago

And what is the volume of data that you would anticipate from a pull of all 16S V4 studies?

From: adswafford notifications@github.com<mailto:notifications@github.com> Reply-To: biocore/qiita reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, October 4, 2017 at 8:54 AM To: biocore/qiita qiita@noreply.github.com<mailto:qiita@noreply.github.com> Cc: Subscribed subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: Re: [biocore/qiita] Pull EBI study into Qiita (#2322)

Daniel and I had a conversation about this issue last week and early this week where we had an alternative idea which was to start pulling in all the 16S V4 studies from EBI (that we didn't submit already ourselves) and making them public so we don't have data duplication or rely on individual users to pull them in manually. If we want to semi-automate things and let user-desire drive the prioritization of which studies to pull in then I think the implementation described above is fine, but if we instead want to just set things up to pull it all in then this could save us the time of making a UI for users to do it.

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocore/qiita/issues/2322#issuecomment-334202607, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANq27dLKTKL44bK34fFTG3s4i_9k8Q8tks5so6ougaJpZM4PtxoW.

antgonza commented 7 years ago

Right, that's another option. Some things to consider about this approach is (1) storage, (2) how often do you "crawl" EBI for new studies?, and (3) should all be downloaded or should we have some kind of prioritization (public in EBI doesn't mean published or nice to have)?

wasade commented 7 years ago

To add, not retaining the ebi sequence data locally for those fetches, just process and keep summarized results (e.g. biom)

On Oct 4, 2017 8:54 AM, "adswafford" notifications@github.com wrote:

Daniel and I had a conversation about this issue last week and early this week where we had an alternative idea which was to start pulling in all the 16S V4 studies from EBI (that we didn't submit already ourselves) and making them public so we don't have data duplication or rely on individual users to pull them in manually. If we want to semi-automate things and let user-desire drive the prioritization of which studies to pull in then I think the implementation described above is fine, but if we instead want to just set things up to pull it all in then this could save us the time of making a UI for users to do it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/qiita/issues/2322#issuecomment-334202607, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8stBLhqc2fE6od1P5qdqCfIT90vRsks5so6otgaJpZM4PtxoW .

wasade commented 7 years ago

So storage is negligible

On Oct 4, 2017 8:58 AM, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu wrote:

To add, not retaining the ebi sequence data locally for those fetches, just process and keep summarized results (e.g. biom)

On Oct 4, 2017 8:54 AM, "adswafford" notifications@github.com wrote:

Daniel and I had a conversation about this issue last week and early this week where we had an alternative idea which was to start pulling in all the 16S V4 studies from EBI (that we didn't submit already ourselves) and making them public so we don't have data duplication or rely on individual users to pull them in manually. If we want to semi-automate things and let user-desire drive the prioritization of which studies to pull in then I think the implementation described above is fine, but if we instead want to just set things up to pull it all in then this could save us the time of making a UI for users to do it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/qiita/issues/2322#issuecomment-334202607, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8stBLhqc2fE6od1P5qdqCfIT90vRsks5so6otgaJpZM4PtxoW .

adswafford commented 7 years ago

If we just keep the biom files though we can't guarantee consistent processing. However, this brings up a related issue: once sequences are processed, should we have a way for users to delete the original FASTQ files since they're not needed? I can see the rationale for hanging onto the demuxed files but not the original.

jdereus commented 7 years ago

But there would still be an associated spike in storage at some point.

What does temp storage look like? Is only single study pulled down, processed, samples "discarded", move to next one?

From: Daniel McDonald notifications@github.com<mailto:notifications@github.com> Reply-To: biocore/qiita reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, October 4, 2017 at 8:59 AM To: biocore/qiita qiita@noreply.github.com<mailto:qiita@noreply.github.com> Cc: "Dereus, Jeff" jdereus@ucsd.edu<mailto:jdereus@ucsd.edu>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [biocore/qiita] Pull EBI study into Qiita (#2322)

So storage is negligible

On Oct 4, 2017 8:58 AM, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu<mailto:Daniel.Mcdonald@colorado.edu> wrote:

To add, not retaining the ebi sequence data locally for those fetches, just process and keep summarized results (e.g. biom)

On Oct 4, 2017 8:54 AM, "adswafford" notifications@github.com<mailto:notifications@github.com> wrote:

Daniel and I had a conversation about this issue last week and early this week where we had an alternative idea which was to start pulling in all the 16S V4 studies from EBI (that we didn't submit already ourselves) and making them public so we don't have data duplication or rely on individual users to pull them in manually. If we want to semi-automate things and let user-desire drive the prioritization of which studies to pull in then I think the implementation described above is fine, but if we instead want to just set things up to pull it all in then this could save us the time of making a UI for users to do it.

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/qiita/issues/2322#issuecomment-334202607, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8stBLhqc2fE6od1P5qdqCfIT90vRsks5so6otgaJpZM4PtxoW .

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/biocore/qiita/issues/2322#issuecomment-334204266, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANq27aEOnaG-oh4vaGTZnKnz3QH9CckCks5so6tmgaJpZM4PtxoW.

jdereus commented 7 years ago

Is there the situation where someone else comes along, views public study and wants to reprocess with the study original fastq files?

From: adswafford notifications@github.com<mailto:notifications@github.com> Reply-To: biocore/qiita reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, October 4, 2017 at 9:01 AM To: biocore/qiita qiita@noreply.github.com<mailto:qiita@noreply.github.com> Cc: "Dereus, Jeff" jdereus@ucsd.edu<mailto:jdereus@ucsd.edu>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [biocore/qiita] Pull EBI study into Qiita (#2322)

If we just keep the biom files though we can't guarantee consistent processing. However, this brings up a related issue: once sequences are processed, should we have a way for users to delete the original FASTQ files since they're not needed? I can see the rationale for hanging onto the demuxed files but not the original.

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/biocore/qiita/issues/2322#issuecomment-334204953, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANq27cpcJmWq4zT61G0hGBlttV2Ms-Adks5so6vugaJpZM4PtxoW.

wasade commented 7 years ago

If we perform the processing of the EBI data, then we can assert consistent processing and associate the correct processing parameters. Performing additional processing for a (pure) EBI study would require fetching the raw sequence data again, but that's reasonably fast.

On Wed, Oct 4, 2017 at 9:01 AM, adswafford notifications@github.com wrote:

If we just keep the biom files though we can't guarantee consistent processing. However, this brings up a related issue: once sequences are processed, should we have a way for users to delete the original FASTQ files since they're not needed? I can see the rationale for hanging onto the demuxed files but not the original.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/qiita/issues/2322#issuecomment-334204953, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8sgLGhdjdS_ZCGkakDr82_z9gyNNpks5so6vtgaJpZM4PtxoW .

sjanssen2 commented 7 years ago

I tend to prefer @josenavas suggestion for the entry point. One more feature request: would it be possible to automatically pull the according publication and fill the study description in Qiita with the paper's abstract?

antgonza commented 7 years ago

@sjanssen2, sure ...

antgonza commented 7 years ago

@wasade, that's an excellent point: we should be able to delete all EBI submitted files (free space) and just download at the time of processing ... worth another issue so we can discuss?

antgonza commented 7 years ago

Just thought about other thing to consider: A public study in EBI doesn't necessarily needs to be MIxS compatible, what should we add with missing columns? Currently: XXQIITAXX

antgonza commented 5 years ago

Merging 2 issues

Moving comments from https://github.com/biocore/qiita/issues/2826 here.

EBI Import codes currently execute in Tornado's process space, and need to be moved to a Qiita plugin.

Will assign Den as a co-assignee, once he is added as a contributor.