Add ability to use the results of an assembly workflow/assembled genome as input to other workflows

apetkau commented 5 years ago

[x] Add ability to associate assembled genome with SequencingObject
[x] Add ability to automatically assemble a genome on upload
[x] Add ability to save an assembly back to a sample - issue #211
[ ] Add ability to store the input assembly genomes within an analysis submission
[ ] Add ability to use assembled genomes within a Galaxy workflow
[ ] Add UI for selecting assembled genomes and submitting a workflow

I could see the overall process of submitting workflows the exact same as we have it now. However, instead of using a SequenceFile as input, we would use an AnalysisSubmission, or specifically an AnalysisAssemblyAnnotation (though we want to keep track of the submission as well). We would pick one of the output files from AnalysisAssemblyAnnotation which could be used as input for other workflows (for example, the contigs file with repeats). See the attached screenshot:

launch-assembly

To store this information within an AnalysisSubmission we would probably have to create different types of submissions (one that takes as input sequence reads and another that takes as input assemblies).

Imported from GitLab issue #245. Originally posted on 2015/06/01 08:42AM

tom114 commented 5 years ago

Relies on #211

Imported from GitLab. Originally posted on 2017/06/22 02:26PM

tom114 commented 5 years ago

This is blocked on updating the files data model.

This is sufficiently detailed that I'm going to keep it open.

Imported from GitLab. Originally posted on 2015/12/02 03:28PM Originally posted by Franklin Bristow

tom114 commented 5 years ago

Ok great. That all sounds really good, Aaron. Let me know if you need any assistance with anything.

Imported from GitLab. Originally posted on 2015/06/03 01:33PM Originally posted by Franklin Bristow

apetkau commented 5 years ago

Yes, I was only concerned about newly uploaded data. With about 10,000 samples now running them all would clog up our cluster for weeks. We can selectively run assemblies on older data if it is requested.

For a quick estimate, looking through the assemblies we already have, it's taking about an average of 4 hours per assembly. Our cluster should be big enough to run all the samples from a miseq run at once (about 30), so only about 4 hours per miseq run to finish all the assemblies. Of course, some will take longer, some will fail, but that's not even too bad.

Imported from GitLab. Originally posted on 2015/06/03 01:30PM

tom114 commented 5 years ago

To clarify 1: are you creating something to run assemblies on already uploaded data, or are you just writing something that will submit newly uploaded data? I think we should focus on doing newly uploaded data and then we can write something that would run assemblies on already uploaded data later.

Imported from GitLab. Originally posted on 2015/06/03 01:16PM Originally posted by Franklin Bristow

apetkau commented 5 years ago

Okay, that all makes sense to me. I'm thinking, for the order to work on these:

I'll work on being able to automate running assemblies first, since I think this will be the most useful. I know some of the assemblies that get run will fail (too much data, etc), but that's okay because these can be updated later.
I'll then work on being able to allow users to replace existing assemblies. This will then let people replace any of those assemblies that failed out.
Adding in the ability to submit these assembled genomes for additional analyses (which will let us start implementing some of the other analysis types we've discussed, like FFP phylogeny, Erics signature detection, etc).

I'm just working on the initial stages of updating the model right now.

Imported from GitLab. Originally posted on 2015/06/03 01:11PM

tom114 commented 5 years ago

I think it would be pretty trivial to add a checkbox or something to indicate whether or not automated assemblies should take place. For a first cut, I'd say we should unconditionally assemble all uploads, then we can add a feature immediately afterward to allow a project manager to choose whether or not to automatically assemble uploads. I agree that it could fill up our cluster, but I'm not sure that's a problem that we should have immediate concerns about. I think NCBI actually assembles all of the data that is uploaded to SRA (or at least they do some kind of analysis automatically), so I don't think it's totally unreasonable for us to do it, too. If it becomes a serious problem, then we can remove the automated approach from that file processing chain and consider buying more hardware.
Ok, I agree with restricting the application of assigning assemblies to only individual assemblies, let's just throw collections out of scope entirely and not worry about it. For permissions, we should allow only people with a project manager (canUpdateSample permission on a file) to assign the analysis back to a file. We can update the canReadAnalysisSubmission and canReadAnalysis permissions to check to see if the analysis has been assigned to a file, then ask about the permissions on the file given the current user to decide if the user can read them. These shouldn't be major changes to the data model or the permissions model.

For which to do first, they both share so much functionality and changes that I think either is fine to choose.

Imported from GitLab. Originally posted on 2015/06/03 11:46AM Originally posted by Franklin Bristow

apetkau commented 5 years ago

Awesome, that makes sense. Having a service method like findAssembliesForSample would work out. If a pipeline gets loaded up that needs assemblies, we can use it to get a list of assemblies.

For associating an assembly to a pair of files:

Automated: I do like the automated idea. I think is eventually what we want to move towards so that users don't have to run assemblies for everything they uploaded. Hooking into the FileProcessingChain makes sense to me. However, a few issues I can see with this method:
1. There exists projects in our NGS Archive currently where it does not make sense to ever run assemblies. For example, there exists sequencing data for specific regions of a genome and not the whole genome. This would just crash any automated assemblies. One idea I had was to only trigger automated assemblies on project by project basis (though would this default to on or off?).
2. This has the potential to completely fill up our cluster very quickly.
Manual: This manual method makes sense to me. Allowing a user to run an assembly and click a button share with sequence file pair. I feel that this only makes sense for the regular assembly workflow though, so assembly collections wouldn't have to be supported (it's just there so people can run many assemblies at once before we get all this code in place). However, some issues I can see with this method are:
1. Permissions. In particular, who gets to write an assembly back to a SequenceFilePair? Just anyone? Just the MANAGER of a Project containing the Sample containing the SequenceFilePair (can a pair of files belong to multiple Samples?).
2. Permissions on AnalysisSubmission. This currently only has an owner. If they share the results back to a SequenceFilePair, we may have to modify this to allow any user read access? Or any user who has read access to the SequenceFilePair?
3. Having to click "share assembly back to sequence file pair" can get tedious, especially if we have no automated system in place.

Given all these issues, I'm wondering if it makes sense to start out with the Manual version, and work our way to the Automated version? However, all those permission problems will still get in the way.

Imported from GitLab. Originally posted on 2015/06/03 09:38AM

tom114 commented 5 years ago

If the assembly is associated with the file/pair of files, then we have a transitive relationship with the sample. We can write a service method something like findAssembliesForSample that would return a list of assemblies attached to files that are attached to the sample.

I propose two methods of assigning an assembly to a file/pair:

Automated: When the pair or file are created, automatically run an assembly with the default values. We can hook into the FileProcessingChain to handle creating the submission automatically. The default assembly might just be "good enough" and doesn't need to be changed.
Manual: When the default assembly does need to be changed, then the user can run an assembly (as they already can), but we can add a button to the assembly results pages (somewhere) that says something like "Use this assembly as the default assembly for this file/pair". Clicking the button would update the file/pair to use the assembly on the current page. This solution is complicated a little bit with the assembly collections, but we could say for a first cut that we restrict that functionality to only individual assemblies.

How does that sound?

Imported from GitLab. Originally posted on 2015/06/03 09:16AM Originally posted by Franklin Bristow

apetkau commented 5 years ago

So, Matt suggested that we should probably associate assemblies with the exact sequence files used to construct them (so with a SequenceFilePair or a SequenceFile).

For how the files are used after that, I do feel they should also be associated with the Sample in some way. Possibly, a Sample could have/return a Set<AssembledGenome> of all the assemblies associated with any sequence files in that Sample. My reason is that it was requested at one point to be able to associate an assembly with a sample (that's what issue #211 is on). This would then allow us to easily complete this issue by letting users select samples to use for a pipeline and listing (possibly giving the option to choose) which assembly to use for each Sample, similar to how we give the option to choose different sequence files.

What I am still unclear about is how users would associate assemblies with samples right now. In the long run, it makes sense to be automated (similar to FastQC analysis, triggered on upload), but in the short term we may have to let users manually run assemblies and possibly select different parameters. I am going to see if I can discuss it a bit more with Matt.

Imported from GitLab. Originally posted on 2015/06/03 09:06AM

phac-nml / irida

Add ability to use the results of an assembly workflow/assembled genome as input to other workflows #57