openwdl / wdl

Workflow Description Language - Specification and Implementations
https://www.openwdl.org/
BSD 3-Clause "New" or "Revised" License
775 stars 307 forks source link

Can we have type aliases? #335

Open nh13 opened 5 years ago

nh13 commented 5 years ago

I would like to be able to create type aliases, namely something like this:

...
type PathToFastq = File
type PathToFasta = File
...
task bwa {
    input {
        PathToFastq read1
        PathToFastq read2
        PathToFasta fasta
        ...
    }
    ...
}

What do other folks think? It would make reading the inputs a lot clearer and would allow folks to alias File to their favorite genomic file format (ex. BED, BAM, XLSX ...).

I'd be happy to make a PR; since this is my first time adding to the spec, I wouldn't mind some help figuring out where it would go.

dvoet commented 5 years ago

Hi Nils,

My first reaction is that this actually hinders readability by hiding the real type. It is just one more thing to keep in my head. I would rather people name their inputs more clearly.

One reason I would support aliases is if the aliased type is a bit more complex, i.e. the type itself is hard to read and benefits from a more succinct alias. Another would be if there were some way to attach validation to an alias (e.g. validate that the file is actually a fastq).

geoffjentry commented 5 years ago

On paper, this is what the Struct use case is intended to handle. In @nh13's case, it's just a struct with a single member

nh13 commented 5 years ago

@geoffjentry so you think I should either embed the file description in the variable name (ex. ref_fasta or targets_bed), or do this?

struct PathToFastq {
    File path
}
struct PathToFasta {
    File path
}
...
task bwa {
    input {
        PathToFastq read_one
        PathToFastq read_two
        PathToFastq ref
    }
    command <<<
    bwa mem ~{ref.path} ~{read_one.path} ~{read_two.path}
    ...
    >>>

    ...
}

I mean it's not that ugly, but a type alias would be nicer and more readable in my opinion. Either I have to (a) have a verbose description in the variable name, (b) use a struct like above, or (c) write a lot of comments and hopes someone reads them.

patmagee commented 5 years ago

I think the question here is really what does the type alias hope to accomplish that proper naming of the variable would not? IMO if you are using the type to infer meaning of the varibale to the user then you are not naming your variables properly.

Its also worth noting that the way you are naming the File varibale (PathToFastq) gives the impression that the File is actually a String type and not a File type. To me the naming pattern you suggest actually makes it harder to read, since I would look for a separate reference to a File somewhere else and would not think that the PathToFastq actually refers to a file

nh13 commented 5 years ago

@patmagee I could also alias FastqFile if that makes it easier for the end-user. Path is is a path-to-a-file-or-directory (common in Java/Scala for example), so it is pretty clear it's a file system object. Also, I strongly disagree with statement:

IMO if you are using the type to infer meaning of the varibale (sic) to the user then you are not naming your variables properly.

There's a reason why typed languages, and typing in python, is so widely used. Why are types even in wdl then? Similarly, what are file extensions for? My Excel doc should be named manuscript_nh_apr_version_2_excel? Sorry, I don't buy the argument that simply the variable name needs to better chosen.

I want to know if the input should be a string, integer, float, file, URI, and so on. But I can certainly imagine custom types (ex. non-negative integer) in addition to aliases (this file must be a FASTQ file). I think allowing types to perform their own validation is a valid request, but out of scope here.

nh13 commented 5 years ago

@geoffjentry can you chime in you to use a struct instead (see my example)? It it is a bit ugly. Perhaps other users could chime in on this issue? I'll concede the issue if there isn't sufficient community desire.

patmagee commented 5 years ago

@nh13 types of a variable givce information on how an object / variable can be used. But Not necessarily what it should be used for. Generally when developping even in strongly typed languages, such as java/scala, you use variable names to give context for this case. poorly named variables obfuscate the meaning of code more then poorly named types imo.

I would say if you need something LIKE a type alias, then use a Struct as @geoffjentry suggested. The example you posted would be the appropriate way to accomplish this.

jtratner commented 5 years ago

First, we should rewrite @nh13 's example to be clearer:

...
type Fastq = File
type Fasta = File
...
task bwa {
    input {
        Fastq read1
        Fastq read2
        Fasta fasta
        ...
    }
    ...
}

With that context, I like the extra readability and type safety that you get from naming. We've hit this point where we have type names like logs_tar_gz on some task or config_json which just gets really goofy looking. There's a reason people moved away from hungarian notation: it's extra verbosity for no reason.

There's two concerns I have with this:

  1. Would this break reusability of WDL tasks? (i.e., what happens if you import tasks from a common tools repo - how will you type cast?)
  2. Will this greatly increase engine implementation complexity? How would validation work?

This could be a compile time optimization that isn't checked at runtime.

As long as there's a way to have some kind of backwards compatibility in the parsing library, I think (2) will not be that bad.

DavyCats commented 5 years ago

I strongly agree with @patmagee here, names should indicate what the value of the variable represents and types should only indicate how they can be used. If your naming relies on the type to be understandable, you will have to look up its declaration whenever you encounter it at some other point in your code:

NumberOf samples = 12

[...many lines of code...]

call someTask {
    input:
        samples=samples 
        # Me when I see this: "Oh, this task recieves an array of samples.",
        # but really it's just given an Int! I probably should just have called it
        # numberOfSamples.
}

In addition, these custom type names will make it harder for a third party to read you wdl files:

input {
    Fastq reads
    # Is is this a File, a String or a Struct?
    # Do I now have to look for a struct definition or a type alias?

    # vs

    File fastq
    # Ah, cool, a File representing a Fastq!
}

I also wonder how this will interact with structs, this looks like it might create some ambiguity in the syntax.

Just as a note, typing has many benefits without clearifying what a variable represents. It can speed up run/compile time, as you only need to check whether the type is approriate for a certain operation, not what the type actually is. I image it also simplifies how variables (and the underlying memory management) are handeled by the compiler/interpreter/execution-engine. It also lets other developers know what type should be used as parameter to function calls. This greatly reduces the chance of unexpected behavior and errors resulting from functions getting called with types the author of the function didn't expect. It allows IDEs to check and warn you of these kinds of mistakes and suggest functions that can be used for a given variable.
jdidion commented 4 years ago

If the intent is to allow the user to be more specific about the format of a file, then a better approach might be more along the lines of what CWL does with IRIs and the format parameter attribute.

Another approach is what DNAnexus does with "types". Types are essentially a controlled vocabulary that can be defined by each organization. File objects can be tagged with types. For a file object to be a valid input to a specific input parameter, it must satisfy that parameter's type restrictions, which may be a single type or a complex boolean expression, such as Fastq AND Read1.

jdidion commented 9 months ago

Also see #200 for another example were we might want type aliases: defining private variables with very long type signatures.

rhpvorderman commented 9 months ago

I think #200 is more of an example of "maybe you should restructure your workflow" than anything else. In BioWDL we have multiple nested subworkflows, and we never ended up with such a complex type.

For readability it is simply much better to make the types explicit than to alias them. Yes, this leads to some weird definitions at times, but at least it is immediately clear from the definitions what you can do with them.