Open nh13 opened 5 years ago
Hi Nils,
My first reaction is that this actually hinders readability by hiding the real type. It is just one more thing to keep in my head. I would rather people name their inputs more clearly.
One reason I would support aliases is if the aliased type is a bit more complex, i.e. the type itself is hard to read and benefits from a more succinct alias. Another would be if there were some way to attach validation to an alias (e.g. validate that the file is actually a fastq).
On paper, this is what the Struct use case is intended to handle. In @nh13's case, it's just a struct with a single member
@geoffjentry so you think I should either embed the file description in the variable name (ex. ref_fasta
or targets_bed
), or do this?
struct PathToFastq {
File path
}
struct PathToFasta {
File path
}
...
task bwa {
input {
PathToFastq read_one
PathToFastq read_two
PathToFastq ref
}
command <<<
bwa mem ~{ref.path} ~{read_one.path} ~{read_two.path}
...
>>>
...
}
I mean it's not that ugly, but a type alias would be nicer and more readable in my opinion. Either I have to (a) have a verbose description in the variable name, (b) use a struct
like above, or (c) write a lot of comments and hopes someone reads them.
I think the question here is really what does the type alias hope to accomplish that proper naming of the variable would not? IMO if you are using the type to infer meaning of the varibale to the user then you are not naming your variables properly.
Its also worth noting that the way you are naming the File
varibale (PathToFastq
) gives the impression that the File
is actually a String
type and not a File
type. To me the naming pattern you suggest actually makes it harder to read, since I would look for a separate reference to a File
somewhere else and would not think that the PathToFastq
actually refers to a file
@patmagee I could also alias FastqFile
if that makes it easier for the end-user. Path
is is a path-to-a-file-or-directory (common in Java/Scala for example), so it is pretty clear it's a file system object. Also, I strongly disagree with statement:
IMO if you are using the type to infer meaning of the varibale (sic) to the user then you are not naming your variables properly.
There's a reason why typed languages, and typing in python, is so widely used. Why are types even in wdl then? Similarly, what are file extensions for? My Excel doc should be named manuscript_nh_apr_version_2_excel
? Sorry, I don't buy the argument that simply the variable name needs to better chosen.
I want to know if the input should be a string, integer, float, file, URI, and so on. But I can certainly imagine custom types (ex. non-negative integer) in addition to aliases (this file must be a FASTQ file). I think allowing types to perform their own validation is a valid request, but out of scope here.
@geoffjentry can you chime in you to use a struct
instead (see my example)? It it is a bit ugly. Perhaps other users could chime in on this issue? I'll concede the issue if there isn't sufficient community desire.
@nh13 types of a variable givce information on how an object / variable can be used. But Not necessarily what it should be used for. Generally when developping even in strongly typed languages, such as java/scala, you use variable names to give context for this case. poorly named variables obfuscate the meaning of code more then poorly named types imo.
I would say if you need something LIKE a type alias, then use a Struct
as @geoffjentry suggested. The example you posted would be the appropriate way to accomplish this.
First, we should rewrite @nh13 's example to be clearer:
...
type Fastq = File
type Fasta = File
...
task bwa {
input {
Fastq read1
Fastq read2
Fasta fasta
...
}
...
}
With that context, I like the extra readability and type safety that you get from naming. We've hit this point where we have type names like logs_tar_gz
on some task or config_json
which just gets really goofy looking. There's a reason people moved away from hungarian notation: it's extra verbosity for no reason.
There's two concerns I have with this:
This could be a compile time optimization that isn't checked at runtime.
As long as there's a way to have some kind of backwards compatibility in the parsing library, I think (2) will not be that bad.
I strongly agree with @patmagee here, names should indicate what the value of the variable represents and types should only indicate how they can be used. If your naming relies on the type to be understandable, you will have to look up its declaration whenever you encounter it at some other point in your code:
NumberOf samples = 12
[...many lines of code...]
call someTask {
input:
samples=samples
# Me when I see this: "Oh, this task recieves an array of samples.",
# but really it's just given an Int! I probably should just have called it
# numberOfSamples.
}
In addition, these custom type names will make it harder for a third party to read you wdl files:
input {
Fastq reads
# Is is this a File, a String or a Struct?
# Do I now have to look for a struct definition or a type alias?
# vs
File fastq
# Ah, cool, a File representing a Fastq!
}
I also wonder how this will interact with structs, this looks like it might create some ambiguity in the syntax.
If the intent is to allow the user to be more specific about the format of a file, then a better approach might be more along the lines of what CWL does with IRIs and the format
parameter attribute.
Another approach is what DNAnexus does with "types". Types are essentially a controlled vocabulary that can be defined by each organization. File objects can be tagged with types. For a file object to be a valid input to a specific input parameter, it must satisfy that parameter's type restrictions, which may be a single type or a complex boolean expression, such as Fastq AND Read1
.
Also see #200 for another example were we might want type aliases: defining private variables with very long type signatures.
I think #200 is more of an example of "maybe you should restructure your workflow" than anything else. In BioWDL we have multiple nested subworkflows, and we never ended up with such a complex type.
For readability it is simply much better to make the types explicit than to alias them. Yes, this leads to some weird definitions at times, but at least it is immediately clear from the definitions what you can do with them.
I would like to be able to create type aliases, namely something like this:
What do other folks think? It would make reading the inputs a lot clearer and would allow folks to alias
File
to their favorite genomic file format (ex. BED, BAM, XLSX ...).I'd be happy to make a PR; since this is my first time adding to the spec, I wouldn't mind some help figuring out where it would go.