hammerlab / biokepi

Bioinformatics Ketrew Pipelines
Apache License 2.0
27 stars 4 forks source link

Pipeline design discussion #172

Closed rleonid closed 8 years ago

rleonid commented 8 years ago

This issue is a scratchpad of features and ideas on bioinformatics pipelines.

The pipeline's notion of a fastq should be parameterized by what it contains: DNA or RNA (from cDNA). This could allow us to at compile time make sure that we don't run incompatible tools and data.

arahuja commented 8 years ago

This similarly came up for BAMs and was somewhat addressed but maybe can be reworked. https://github.com/hammerlab/biokepi/issues/26

smondet commented 8 years ago

My issue is that creating ontologies always ends up being wrong in some way. E.g. what's the problem with running an RNA aligner on DNA? Maybe we will want to do it one day.

If there is an actual real failure we want to avoid, then yes, but when it comes to the semantics, we cannot be sure of anything we may want in the future.

Yesterday I was running a somatic variant caller comparing two "normal" samples; there is nothing "wrong" with that.

ihodes commented 8 years ago

I'm on board with @smondet on this one, having first thought it might be a nice idea. We probably don't want to get in the business of telling users which tools they can run on which data, as long as the tool accepts the data structure (e.g. we prevent BWA from running on CSVs, sure, but otherwise it can run on RNA BAMs or DNA BAMs or some other kind of future BAM).

rleonid commented 8 years ago

I think we can allow both. My interest in this design is in making the code from Biokepi more production ready, but we can have an option type to allow the more experimental workflows.

ihodes commented 8 years ago

What would that look like in practice?

On Wed, Mar 16, 2016 at 3:56 PM Leonid Rozenberg notifications@github.com wrote:

I think we can allow both. My interest in this design is in making the code from Biokepi more production ready, but we can have an option type to allow the more experimental workflows.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197518298

rleonid commented 8 years ago
type ns = DNA | RNA
type ns_spec = ns option
iskandr commented 8 years ago

What about bisulfite sequencing or ribosome footprint profiling? What other values can we imagine ns taking on?

On Thu, Mar 17, 2016 at 1:11 PM, Leonid Rozenberg notifications@github.com wrote:

type ns = DNA | RNAtype ns_spec = ns option

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197978600

ihodes commented 8 years ago

@rleonid how would you envision consuming that information? A fasta, for example, could say that it's a Some DNA; how would the consumers of that data us this information, then? (Since we wouldn't, for example, restrict BWA-MEM to run on DNA only).

leo commented 8 years ago

@ihodes Wrong username :blush:

ihodes commented 8 years ago

@leo hah so sorry—Slack usernames colliding with GitHub usernames…

rleonid commented 8 years ago

@iskandr Whatever other sequences we'd imagine, tools that depend on that sequence could be adjusted as needed.

@ihodes Sure, BWA-MEM would ignore it, but Seq2HLA would throw an exception if we pass it (Some DNA) fastq but not for None fastq (assume RNA) or (Some RNA) fastq. We could adapt a GADT too.

ihodes commented 8 years ago

Cool, this seems useful.

iskandr commented 8 years ago

@leo I guess I'm bothered by how RNA is not an intrinsic property of the FASTQ -- the FASTQ still contains reads in the DNA alphabet. It's more of a conceptual understanding that "these FASTQs came from RNAseq". But RNAseq is just one assay among many! Do we want to differentiate between WES vs. WGS (e.g. if a fusion detection tool expects WGS DNAseq)? Do we want to distinguish long read from short read sequencing (e.g. if an assembly algorithm expects both PacBio and Illumina reads as inputs)? What about amplicon vs. capture libraries (since MarkDuplicates can't be run on data generated by Ion Torrent). It seems like there are many conceivable tags besides RNA vs. DNA.

On Thu, Mar 17, 2016 at 1:51 PM, Isaac Hodes notifications@github.com wrote:

Cool, this seems useful.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197998757

hammer commented 8 years ago

Would an extensible variant type be useful to capture your concerns @iskandr?

smondet commented 8 years ago

I agree with @iskandr we're far too ignorant to build a valid ontology there.

smondet commented 8 years ago

@hammer an extensible type means that every pattern matching would end with | _ -> accept-by-default (since we want to allow experimentation).

hammer commented 8 years ago

@smondet i'm a bit surprised at your rejection of modeling more of the world in the type system rather than less; we're never going to have a complete ontology but does that mean we should punt entirely? Isn't there a way iteratively refine our ontology as it deepens?

smondet commented 8 years ago

@hammer I want to use the type-system to reject stuff that does not work as soon as possible. But running an RNA aligner on DNA can work and be useful (and, actually, running a DNA aligner on DNA can still fail for mysterious reasons ☺).

hammer commented 8 years ago

@smondet adding a string and an integer in javascript can work and be useful too! allowing the user to supply more information about what's in a file to downstream workflow nodes via the type system seems like a win to me. we can make "unsafe" workflow nodes if we want that can accept all kinds of fastq files while using "safe" ones in our production pipelines.

ihodes commented 8 years ago

But running an RNA aligner on DNA can work and be useful

I think @rleonid's proposal accommodates this, no?

smondet commented 8 years ago

@hammer yes but I feel we have good understanding of javascript brokenness; whereas we completely revolutionize our (mis)understanding of bioinformatics every day. Just with 2 comments above @iskandr reminded me of 10 different relevant things I (nor anyone) had thought about.

@ihodes yes I'm not against that; I don't know how it would be manageable in a practical way.

iskandr commented 8 years ago

@ihodes It's not clear to me why you wouldn't want to try running seq2hla on WES or WGS data. It's not the most common or preferred way but I'm already curious to see how well it would work. Why should the pipeline disallow the experiment?

Truly broken operations are trying to run seq2hla on BAM, BED, BAI, &c files.

hammer commented 8 years ago

@smondet That's a strong claim that no one had thought about how to encode information about the assay that generated the FASTQ into the type system. I had always imagined that information would be available to workflow nodes. Claiming that there is a lot of information and new information that gets added all the time still does not justify not putting some of that information into the type system.

@iskandr note the comment above about "safe" an "unsafe" workflow nodes. I don't see why you wouldn't want some information available that can be explicitly ignored if you want but that can be used when safety is desired.

ihodes commented 8 years ago

@iskandr I'm saying exactly that; you might want to run some random data through some random tool; the proposal does not prevent this from happening. Rather, it lets a pipeline writer decide it they want to be "safer". This does not remove any flexibility.

iskandr commented 8 years ago

@hammer Using extensible variants to annotate a FASTQ with a collection of tags might work, especially if tag conformity checking is a special "safe" mode for Biokepi. What would it look it to workflow authors to distinguish between safe and unsafe modes?

iskandr commented 8 years ago

@ihodes I'm probably misunderstanding what's being discussed then -- where is the choice for a pipeline author? It seems like it's up to all the wrapped tools to decide which FASTQs they accept or reject.

By the way, are we going to add similar meta-data to BAMs? It seems like there's more room for mistakes in mixing e.g. sorted vs. unsorted BAMs.

smondet commented 8 years ago

@iskandr sortedness of bams is already in use (it has a real impact because we can sort only "if necessary")

Bams have an [RNA | DNA ] tag that we don't use anywhere.

https://github.com/hammerlab/biokepi/blob/master/src/lib/common.ml#L87

ihodes commented 8 years ago

@iskandr This could be implemented for any data type, sure.

The way @rleonid described it, a pipeline writer could, for example, annotate their FASTQ as a Some DNA fastq, and a tool could make on this annotation and ensure that it's known to be able to handle DNA fastqs.

If the pipeline writer wanted to run on a tool that isn't known to accept a DNA fastq, the fastq would remain unannotated; with the default None fastq type. All tools would match on the None case and allow the tool to (attempt to) run on that data.