Closed rleonid closed 8 years ago
This similarly came up for BAMs and was somewhat addressed but maybe can be reworked. https://github.com/hammerlab/biokepi/issues/26
My issue is that creating ontologies always ends up being wrong in some way. E.g. what's the problem with running an RNA aligner on DNA? Maybe we will want to do it one day.
If there is an actual real failure we want to avoid, then yes, but when it comes to the semantics, we cannot be sure of anything we may want in the future.
Yesterday I was running a somatic variant caller comparing two "normal" samples; there is nothing "wrong" with that.
I'm on board with @smondet on this one, having first thought it might be a nice idea. We probably don't want to get in the business of telling users which tools they can run on which data, as long as the tool accepts the data structure (e.g. we prevent BWA from running on CSVs, sure, but otherwise it can run on RNA BAMs or DNA BAMs or some other kind of future BAM).
I think we can allow both. My interest in this design is in making the code from Biokepi more production ready, but we can have an option type to allow the more experimental workflows.
What would that look like in practice?
On Wed, Mar 16, 2016 at 3:56 PM Leonid Rozenberg notifications@github.com wrote:
I think we can allow both. My interest in this design is in making the code from Biokepi more production ready, but we can have an option type to allow the more experimental workflows.
— You are receiving this because you commented.
Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197518298
type ns = DNA | RNA
type ns_spec = ns option
What about bisulfite sequencing or ribosome footprint profiling? What other
values can we imagine ns
taking on?
On Thu, Mar 17, 2016 at 1:11 PM, Leonid Rozenberg notifications@github.com wrote:
type ns = DNA | RNAtype ns_spec = ns option
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197978600
@rleonid how would you envision consuming that information? A fasta, for example, could say that it's a Some DNA
; how would the consumers of that data us this information, then? (Since we wouldn't, for example, restrict BWA-MEM to run on DNA only).
@ihodes Wrong username :blush:
@leo hah so sorry—Slack usernames colliding with GitHub usernames…
@iskandr Whatever other sequences we'd imagine, tools that depend on that sequence could be adjusted as needed.
@ihodes Sure, BWA-MEM would ignore it, but Seq2HLA
would throw an exception if we pass it (Some DNA) fastq
but not for None fastq
(assume RNA) or (Some RNA) fastq
. We could adapt a GADT too.
Cool, this seems useful.
@leo I guess I'm bothered by how RNA
is not an intrinsic property of the
FASTQ -- the FASTQ still contains reads in the DNA alphabet. It's more of a
conceptual understanding that "these FASTQs came from RNAseq". But RNAseq
is just one assay among many! Do we want to differentiate between WES vs.
WGS (e.g. if a fusion detection tool expects WGS DNAseq)? Do we want to
distinguish long read from short read sequencing (e.g. if an assembly
algorithm expects both PacBio and Illumina reads as inputs)? What about
amplicon vs. capture libraries (since MarkDuplicates can't be run on data
generated by Ion Torrent). It seems like there are many conceivable tags
besides RNA vs. DNA.
On Thu, Mar 17, 2016 at 1:51 PM, Isaac Hodes notifications@github.com wrote:
Cool, this seems useful.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/hammerlab/biokepi/issues/172#issuecomment-197998757
Would an extensible variant type be useful to capture your concerns @iskandr?
I agree with @iskandr we're far too ignorant to build a valid ontology there.
@hammer an extensible type means that every pattern matching would end with | _ -> accept-by-default
(since we want to allow experimentation).
@smondet i'm a bit surprised at your rejection of modeling more of the world in the type system rather than less; we're never going to have a complete ontology but does that mean we should punt entirely? Isn't there a way iteratively refine our ontology as it deepens?
@hammer I want to use the type-system to reject stuff that does not work as soon as possible. But running an RNA aligner on DNA can work and be useful (and, actually, running a DNA aligner on DNA can still fail for mysterious reasons ☺).
@smondet adding a string and an integer in javascript can work and be useful too! allowing the user to supply more information about what's in a file to downstream workflow nodes via the type system seems like a win to me. we can make "unsafe" workflow nodes if we want that can accept all kinds of fastq files while using "safe" ones in our production pipelines.
But running an RNA aligner on DNA can work and be useful
I think @rleonid's proposal accommodates this, no?
@hammer yes but I feel we have good understanding of javascript brokenness; whereas we completely revolutionize our (mis)understanding of bioinformatics every day. Just with 2 comments above @iskandr reminded me of 10 different relevant things I (nor anyone) had thought about.
@ihodes yes I'm not against that; I don't know how it would be manageable in a practical way.
@ihodes It's not clear to me why you wouldn't want to try running seq2hla on WES or WGS data. It's not the most common or preferred way but I'm already curious to see how well it would work. Why should the pipeline disallow the experiment?
Truly broken operations are trying to run seq2hla on BAM, BED, BAI, &c files.
@smondet That's a strong claim that no one had thought about how to encode information about the assay that generated the FASTQ into the type system. I had always imagined that information would be available to workflow nodes. Claiming that there is a lot of information and new information that gets added all the time still does not justify not putting some of that information into the type system.
@iskandr note the comment above about "safe" an "unsafe" workflow nodes. I don't see why you wouldn't want some information available that can be explicitly ignored if you want but that can be used when safety is desired.
@iskandr I'm saying exactly that; you might want to run some random data through some random tool; the proposal does not prevent this from happening. Rather, it lets a pipeline writer decide it they want to be "safer". This does not remove any flexibility.
@hammer Using extensible variants to annotate a FASTQ with a collection of tags might work, especially if tag conformity checking is a special "safe" mode for Biokepi. What would it look it to workflow authors to distinguish between safe and unsafe modes?
@ihodes I'm probably misunderstanding what's being discussed then -- where is the choice for a pipeline author? It seems like it's up to all the wrapped tools to decide which FASTQs they accept or reject.
By the way, are we going to add similar meta-data to BAMs? It seems like there's more room for mistakes in mixing e.g. sorted vs. unsorted BAMs.
@iskandr sortedness of bams is already in use (it has a real impact because we can sort only "if necessary")
Bams have an [
RNA | DNA ]
tag that we don't use anywhere.
https://github.com/hammerlab/biokepi/blob/master/src/lib/common.ml#L87
@iskandr This could be implemented for any data type, sure.
The way @rleonid described it, a pipeline writer could, for example, annotate their FASTQ as a Some DNA fastq
, and a tool could make on this annotation and ensure that it's known to be able to handle DNA fastqs.
If the pipeline writer wanted to run on a tool that isn't known to accept a DNA fastq, the fastq would remain unannotated; with the default None fastq
type. All tools would match on the None
case and allow the tool to (attempt to) run on that data.
This issue is a scratchpad of features and ideas on bioinformatics pipelines.
The pipeline's notion of a fastq should be parameterized by what it contains: DNA or RNA (from cDNA). This could allow us to at compile time make sure that we don't run incompatible tools and data.