openwdl / wdl

Workflow Description Language - Specification and Implementations
https://www.openwdl.org/
BSD 3-Clause "New" or "Revised" License
779 stars 306 forks source link

Automatic Coercions Considered Harmful #373

Open orodeh opened 4 years ago

orodeh commented 4 years ago

Automatic coercions considered harmful

Authors: O. Rodeh with J. Didion (DNAnexus Inc.)

This issue describes a line of thought regarding type checking for WDL programs. It is based on experience acquired while working with DNAnexus customers on their WDL pipelines. The author also wrote a WDL type checker, evaluator, and cloud executor.

Coercions

WDL makes wide ranging use of coercions to convert one type to another. This is, in general, a well accepted idea used in a wide variety of programming languages. At DNAnexus we have run into several instances where WDL workflows errored out because static type checking was too lenient. This resulted in long running expensive customer pipelines failing. The issue is twofold: the turn around time is long, and the cost can run to thousands of dollars.

For example, under the WDL type system, a File is auto-coercible into File?. However, this is not the case in the DNAnexus simpler type-system. If you have a task that takes an Array[File?] and call it with Array[File] this will fail. For example:

version 1.0

task foo {
  input {
    Array[File?] bams
  }
  command {}
}

workflow bar {
  input {
    Array[File] my_files
  }

  call long_expensive_calculation {}

  call foo { input : bams = my_files }

  output {}
}

Workflow bar will fail when it calls foo, after it has spent a long time running an expensive calculation. This is frustrating for the users. We could prevent this, as well as many other errors, by performing stricter type checking. Below is an eclectic list of automatic coercions that appears in GATK pipelines. The GATK pipelines are large real world workflows that our customers expect to execute flawlessly on our platform. The code is a simplified version of the real program lines, stripping a way the context to focus on the main point.

Converting between T and T? occurs automatically

  # convert an optional file-array to a file-array
  Array[File] files = ["a", "b"]
  Array[File]? files2 = files

  # Convert an optional int to an int
  Int a = 3
  Int? b = a

  # An array with a mixture of Int and Int?
  Int c = select_first([1, b])

Converting between strings, integer, and floats "just works".

  # Convert an integer to a string
  String s = 3
  Int z = 1.3

  # converting Array[String] to Array[Int]
  Int d = "3"
  Array[Int] numbers = ["1", "2"]

The branches of a conditional may not resolve to the same type. Here, one side is Array[String], the other is Array[String?].

  # https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/JointGenotyping-terra.wdl#L556                        
  #                                                                                                                                
  Array[String] birds = ["finch", "hawk"]                                                                                          
  Array[String?] mammals = []                                                                                                      
  Array[String?] zoo = if (true) then birds else mammals                                                                           

Since T and T? are interchangeable, we can perform operations like T + T? + T.

  # https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.wdl#L570
  #
  # Adding a String and String?
  String? here = "here"
  String buf = "not " + "invented " + here

  # https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.wdl#L1207
  #
  # convert an optional int to string.
  Int? max_output = 3                                                                                                              
  String s1 = "MAX_OUTPUT=" + max_output

The following non-intuitive code type checks in WDL (womtool-51.jar) :

Array[String] strings = [1, 1.2, "boo"]                                                                                            
Array[Int] numbers = [1, 1.2, "3.0"]                                                                                               

And then aai type checks too:

Array[Array[Int]] aai = [[1, 2], ["foo"], [1.4, 4.9, 9.3]]                                                                         

Proposal

Explicit coercions are useful and make sense, for example:

File? x = y                                                                                                                        
Array[File] af = ["a.txt", "b.txt"]                                                                                                

However, auto-coercions can lead to unexpected behavior and costly/confusing job failures. We suggest allowing explicit coercions, adding a few necessary stdlib function for type conversion, and disallowing automatic coercions. A few necessary stdlib functions, written a la Scala, are:

T? Some(T)                                                                                                                         
Int toInt(String)                                                                                                                  
String toString(Int|Float)                                                                                                         
File toFile(String)                                                                                                                

In order to select the first non empty integer, as below:

  Int c = select_first([1, b])                                                                                                     

one would write:

Int c = select_first([Some(1), b])                                                                                                 

This would make the typing of select_first accurate; it would only take an array of T?.

Code like:

 # Convert an optional int to an int                                                                                               
  Int a = 3                                                                                                                        
  Int? b = a                                                                                                                       

would be written:

Int? b = Some(a)

A Pythonic way of writing these functions would be to have a single cast function that takes a type and a value and either casts the value to the type or throws an error. For example:

Int i = cast(Int, "hello")                                                                                                         
String s = cast(String, 1)                                                                                                         
File f = cast(File, "foo/bar.txt")                                                                                                 
File err = cast(File, 1)  # throws an error because Int to File casting is not allowed                                             

cast could even handle optional types, which would get rid of the need for Some(), e.g.

String s1 = "foo/bar.txt"                                                                                                          
String? s2 = None                                                                                                                  
File? f1 = cast(File?, s1)  # f1 == Some(File("foo/bar.txt"))                                                                      
File? f2 = cast(File?, s2) # f2 == None                                                                                            

I hope this approach would reduce the number of preventable runtime errors.

rhpvorderman commented 4 years ago

I agree that implicit int -> string and string -> int typecasts should be forbidden.

I am not so sure about the optional stuff. This is already explicit right? the select_all and select_first functions are there to convert optionals into defined variables.

  # convert an optional file-array to a file-array
  Array[File] files = ["a", "b"]
  Array[File]? files2 = files

  # Convert an optional int to an int
  Int a = 3
  Int? b = a

What is the exact problem with this code? In practice (in BioWDL at least) optional values are only used to set optional command line parameters in tasks. So the following works:

call myTask {
    inputs:
         my_required_input = 5
         my_optional_input = 3
}

After your proposal it will have to be:

call myTask {
    inputs:
         my_required_input = 5
         my_optional_input = Some(3)
}

I don't see what this accomplishes except that it makes WDL harder to write.

orodeh commented 4 years ago

Forcing people to be explicit about their coercions does make WDL more verbose. However, it opens the door to using type inference (@cjllanwarne mentioned this a long time ago).

The WDL type system is actually quite complex. We have overloading, size can be called in several different ways. We also have parametric polymorphism, cross and zip work on any type of array. On top of that, we have auto-coercions, which interact badly with the former two features. If we drop them, I -think-, we could do Mindley-Milner type inference which would make WDL less verbose.

A realistic way of doing this is support type-inference inside a task or a workflow, supplemented with explicit coercions when needed.

patmagee commented 4 years ago

Implementing WDL bindings in java for wdl4j has actually casused me to run into a heck of a lot of issues similar to what @orodeh is describing. There some very unintuitive actions which make expressions evaluatable, despite the fact that the LHS and the RHS simply do not match.

Additionally, it complicated by the fact that one could argue that with auto type coercions, as long as there is a path to the target type, you should be able to coerce any LHS to any RHS.

For example in MiniWDL Array[String] can be coerced to String. Whether this is a bit of syntactic sugar for their use case, or actually allowed int he spec is a bit unclear (the rules around type coercion are a bit nebulous in the spec given how important a feature it is).

Given the above one could do something like

Array[String] a = ["A","1.2"]
Int? b =  1 + a # a -> string -> int -> whole thing to ?

This really does not make any sense, and its a bit ambiguous what it will eventually resolve into. Additionally, there is no strict ruels around what is actually coercible to what... (rules were added in the development spec.. but they still are very vagues). This leads to alot of potential issues in executing workflows on different systems.

Its worth noting too that even python uses explicit type conversions through functions like int(x),float(x), str(x) etc.

pgrosu commented 4 years ago

Coercion semantics usually have another requirement called coherence that needs to be satisfied, which is much more difficult to prove. For reference, here one of the papers:

https://link.springer.com/chapter/10.1007%2F3-540-54415-1_70

I have not seen proofs for WDL, but I'm sure the developers must have them somewhere.

Hope it helps, ~p

patmagee commented 4 years ago

@pgrosu your confidence in us is comforting, but I am not to sure you are going to find any sort of formal proofs, or documentation on coherence. If this is an area that you are comfortable in, and could help us flesh out that would be much much appreciated

Gvaihir commented 2 years ago

How about type coercion of the inputs/outputs from the standard library functions? On the one hand - all inputs and outputs must be typed. But I do see some use cases where:

scatter (pair in zip(foo, bar)) {}

I could not find any such case in the spec, and I think this is a bit contradicting the static typing requirement. In general though seems these examples compile and run, but we start to see coercion problems when nesting such functions and applying them to optional types. Like this:

Array[Int] foo
Array[Int] bar
Array[String?] lol

scatter (nested_pair in zip(zip(foo, bar), lol)) {}

In this case we see the problems of coercion for values of lol which were evaluated to null, throwing exceptions at runtime. If we declare and type the outputs - all fine, no exceptions:

Array[Pair[Pair[Int, Int], String?]] arr = zip(zip(foo, bar), lol)
scatter (nested_pair in arr) {}

We can solve this in dxCompiler and allow such scenarios, but isn't this in the realm of automatic type coercion and contradictory to strong/static typing of the language?

Could you provide more explicit guidance on typing the inputs/outputs for the standard library functions in the language spec?