Extending the standard library

orodeh commented 5 years ago

Assume we want to add an append function. Given an array of strings elements, and a string a , it will return a new array with elements, and a concatenated at the end.

The type would be:

append(Array[String], String) : Array[String]

It can (almost) be implemented as a regular task

version 1.0

task append {
  input {
    Array[String] elements
    String a
  }

  command <<<
  python <<'EOF'
import json

def read_json_file(path):
    with open(path, 'r') as fd:                                                                               
    data = fd.read()                                                                                          
    d = json.loads(data)                                                                                      
    return d                                                                                                  

def write_json_file(path, data):
    with open(path, 'w') as fd:                                                                               
    buf = json.dumps(data)                                                                                    
    fd.write(buf)                                                                                             

# Convert data structures to JSON
elements = read_json_file(~{write_json(elements)})                                                            
a = read_json_file(~{write_json(a)})

# create the result
result = elements + [a]

# serialize into a result file
write_json_file("RESULT.json", result)
EOF
>>>

output {
    Array[String] out = read_json("RESULT.json")                                                              
  }
}

Calling it from a workflow, can look like this:

workflow myWorkflow {
    input {
    Array[String] elements = ["A", "B"]
    String a = "X"
    }

    call append { input: elements = elements, a = a }                                                         

    output {                                                                                                  
    Array[String] result = append.result                                                                      
    }                                                                                                         
}

This almost works in Cromwell version 40. Running it as above results in the errors:

Failed to process task definition 'append' (reason 1 of 2):
Failed to process expression 'write_json(elements)' (reason 1 of 1):
Invalid parameter 'IdentifierLookup(elements)'.
Expected 'Object' but got 'Array[String]'

Failed to process task definition 'append' (reason 2 of 2):
Failed to process expression 'write_json(a)' (reason 1 of 1):
Invalid parameter 'IdentifierLookup(a)'.
Expected 'Object' but got 'String'

The issue is that write_json doesn't work on Array[String] nor String types. By improving the implementation so that it will work, at least for types that have a simple mapping into JSON (Array, String, Int), we would allow writing additional functions in a language of the user's choosing. We would thus avoid the need to add these functions to the standard library.

A potential optimization down the road is running such tasks in an efficient manner by the engine, avoiding the need to launch a job for the evaluation. Language support can be added for this purpose, for example, using function instead of task.

rhpvorderman commented 5 years ago

I strongly oppose this.

One of the reasons I am able to read WDLs of other people is because they cannot have their own functions. Have you ever read a custom Snakemake or Nextflow file? It is a nightmare! Every developer will solve his/her problems in his/her own way. And all this stuff isn't documented of course. Another nightmarish example would be apache airflow. Custom code destroys readability and maintainability like no other functionality in a pipeline framework.

We use WDL all the time, and the times we needed to do some stuff that was not in the standardlib we made a task. This works well enough. It does not happen often.

One big advantage of the stdlib is: there is documentation!. People just don't document their customized function. They will start using customized functions instead of stdlib functions (why read the docs if you can have a customized function right away?) . WDL will become universally more unreadable because non-standard functions are allowed. I think this undermines one of the core strengths of WDL. The fact that you can NOT write your own custom code in it.

DavyCats commented 5 years ago

I agree with @rhpvorderman for the most part. Although I wouldn't say there is a huge difference between writing custom code (or functions) and writing a task containing some (eg.) python code. The biggest difference (I guess) is that having to write a task might dissuade people from doing non-standard things too often, whereas introducing a custom function syntax would promote it. I would prefer to dissuade here.

Also as far as the append functionality is concerned, there is a read_lines() function in the stdlib. So you could probably use a task like this:

task append {
    input {
        Array[String] list
        Sting extraValue
    }
    command <<<
        python <<OEF
        for x in ['~{sep="','" list}']:
            print(x)
        print('~{extraValue}')
        OEF
    >>>
    output {
        Array[String] newList = read_lines(stdout())
    }
}

orodeh commented 5 years ago

@rhpvorderman, this is fair criticism. The alternative is to implement a standard library that handles maps, arrays, pairs, user-defined structs, and recursive structures. Here is what relevant languages have: scala collections stdlib python stdlib

These are large libraries, with significant conformance testing suites. As a practical matter, I can see the Broad supporting this. But I can't see a second implementation, the resources just aren't there.

geoffjentry commented 5 years ago

@orodeh Sorry if I missed it, but how are you proposing the syntax to look?

FWIW I'm skeptical, purely IMO WDL has already gone too far down the road towards general purpose programming language and this would be another big leap further.

rhpvorderman commented 5 years ago

The alternative is to implement a standard library that handles maps, arrays, pairs, user-defined structs, and recursive structures. Here is what relevant languages have: scala collections stdlib python stdlib

I agree with @geoffjentry. Let's not compare WDL to python or scala, because it has a different purpose. I like the fact that the stdlib of WDL is small, so it does not require me as much effort to learn as an entire language.

Can you name any concrete examples of stuff that can't be solved by a task, and for which we would need such a syntax?

orodeh commented 5 years ago

A syntax for implementing a function, that would be very similar to tasks, would:

function append {
  input {
    Array[String] elements
    String a
  }

  output {
    Array[String] out = read_json("RESULT.json")
  }
}

To call it, you would do:

Array[String] result = append(animals, "bonobo")

The kind of method that is missing from the current stdlib is Array.fold_left. For example, if you want to sum the size of an array of files.

rhpvorderman commented 5 years ago

So why not do this in a task? For every complex task that requires scala collections operations: why not use a scala tool/script? Where code is required, you can easily implement it in a task using bash heredoc syntax. WDL cannot convert between yaml and json. So we have [this task]()

task YamlToJson {
    input {
        File yaml
        String outputJson = basename(yaml, "\.ya?ml$") + ".json"
        String dockerTag = "3.13-py37-slim"
    }
    command {
        set -e
        mkdir -p $(dirname ~{outputJson})
        python <<CODE
        import json
        import yaml
        with open("~{yaml}", "r") as input_yaml:
            content = yaml.load(input_yaml)
        with open("~{outputJson}", "w") as output_json:
            json.dump(content, output_json)
        CODE
    }
    output {
        File json = outputJson
    }

    runtime {
        docker: "biowdl/pyyaml:" + dockerTag
    }
}

No need to add a function to the language. I think task already covers this use case, and functions add nothing extra for people who write pipelines.

If such a task is used thousands of times, and everyone has this use case, we can think of adding it to the stdlib. But adding a general function that allows everyone to write incomprehensible WDL is a nightmare from my persepective. Here is how it is going to end if this is implemented:

People will notice this and will use it to solve some thing they can't do with the standard lib
Knowing function they will use that as much as possible. The stdlib functions cannot send e-mail, which is of course a requirement and they know better than the openwdl community and whatnot.
They will put all these functions in a hierarchy of folders and files that import eachother. So you have a nice "library".
They will write pipelines that get published in science and nature. You think: "Hey, I know WDL, let's have a look" only to find an incomprehensible mess of spaghetti code that is impossible to implement in production or research. It is not even possible to guess what the code is doing, because it is obscured by a wall of functions that import functions etc.

THIS is the future I envision for function. And this is why I agree with @geoffjentry and @DavyCats .

Furthermore, Snakemake and Apache Airflow already exist, there you can add custom code to the pipelines. Knowing the horrible effects this has on the readability and maintainability I fled to WDL, which was much nicer. I have strong opinions on this, because not having things such as function is what makes WDL more attractive in my opinion.

orodeh commented 5 years ago

The original problem we are trying to solve is how to extend the standard library. Cataloging the requests we already have, we get about four modules:

Math exponentiation operator (#296) min/max operators (#276)

Array length to work on optional arrays (#234) map over an array (#203) adding and removing elements from arrays (#170) summing up arrays (#137) array slices (#120) contains function for arrays (#117)

Map get list of values (#43)

String contains check (#133)

Clearly, all four categorizes will grow. We'll need string operations, math operations, and data structure methods. As an engine implementer, I would like to build such a WDL standard library without writing a massive amount of code.

Assuming the engine is written in Scala, a way to solve this is to translate WDL expressions to Scala expressions. Then, runtime evaluation reduces to Scala evaluation. For example, if you have a workflow snippet such as:

Map[String, Int] chromosomes = {                                                                                                    
  "1": 259,                                                                                                                         
  "2": 242,                                                                                                                         
  "3": 198,                                                                                                                         
  "4": 190,                                                                                                                         
  "X": 156}                                                                                                                         
Array[String] s = chromosomes.keys

The compiler will convert chromosomes into a Scala Map[String, Int], perform the keys method on the map, resulting in a Vector[String] type.

val chromosomes : Map[String, Int] = Map(                                                                                           
  "1" -> 259,                                                                                                                       
  "2" -> 242,                                                                                                                       
  "3" -> 198,                                                                                                                       
  "4" -> 190,                                                                                                                       
  "X" -> 156)                                                                                                                       
val s : Vector[String] = chromosomes.keys.toVector

Since the expression sub-language is purely functional, it has equivalents in scala. The same can be done for python, including lambda functions (both languages support this).

I hope this will allow using much of the math, string, array, and map operations, without needing to write them ourselves.

DavyCats commented 5 years ago

The question is whether we need full maths and string operations support in wdl. I get that there are some features there that would be useful, but we should remember that wdl was intended to be a workflow language and not a programming language. I honestly feel that, unless some functionality is vital to defining a workflow, we shouldn't add loads of functions to the stdlib.

To quote the openwdl website:

Whether one is an analyst, a programmer, an operator of a production system, or any other sort of user, WDL should be accessible and understandable.

I feel that extending the standard library too much or adding syntax like the one suggested in "map over an array (#203)", would greatly impact understandability for those without a lot of programming experience.

rhpvorderman commented 5 years ago

The original problem we are trying to solve is how to extend the standard library.

Ah, this is where the misunderstanding arises. I don't think a small standard library is a problem. I think it is a blessing.

Cataloging the requests we already have, we get about four modules: [List of PRs]

I don't see how all of these things can not be accomplished with a task. With the exception of #43 I think these PRs are too far-fetched for a Workflow Description Language. As @DavyCats pointed out it is not a Workflow Programming Language.

If you want to use a programming language: use a programming language! WDL allows you all the freedom you could ever want within the command brackets of a task statement.

@orodeh You have not answered my core concern: why can these things not be done in a simple task?

orodeh commented 5 years ago

If you are fine with writing stdlib stuff in tasks, then, a way to do this is outlined at the top of the thread. To recap: (1) Serialize WDL data structures with read_json and write_json. This allows writing data structure manipulation in your language of choice. (2) Mark the task such that the engine may optimize it by running locally, instead of on the cloud.

patmagee commented 5 years ago

@orodeh thank you for putting this thought experiment together! I think this issue really is a defining conversation about what is the difference between WDL and a general purpose programming language.

To me, WDL is and and has always been a Workflow language that provides some high level functionallity for wrapping my low level functions (ie docker, bash scripts etc). While there has been some situations where I have been frustrated to no end due to the lack of certain stdlib functions, this minimalist approach is a blessing in disguise in my opinion. IMO one of the reasons why WDL has caught on is because of its simplicity in writing and documentation. WDL Is easy to read, and easily portable.

The introduction of user defined functions opens a can of worms on multiple fronts. First of all, readability plummets, as was mentioned above. Also we start to shift work OUTSIDE of the scope of the tasks and put them into the function. One of the good things of defining a stdlib is the engine implementer has alot of control over those functions and can ensure efficient usage. I wonder how long before users start to define complex tasks in functions as opposed to in tasks and have the engine perform them. This seems rather dangerous to me in terms of a ballooning memory / cpu requirements.

Additionally, if we were to go this route, I can forsee us quickly getting into the dreaded use case where users want a fully functioning programming language within a function, such is the case with the javascript runtime requirement in CWL. This is quite an onerous requirement and was actually one of the reasons why we never actually implemented our own execution engine at the time (years ago).

I think there is a good balance that we can strive for between functionality and minimalism when it comes to the stdlib without defining a new class of operations called functions. It is clear that there are certain basic engine functions that are missing which would make everyone's life easier. But at the same time we simply do not need ALL of the functionality of a programming language. Peolpe who request more complex features should be kindly pointed to how they can modify or create a task to do what they are trying to accomplish.

patmagee commented 5 years ago

@orodeh I also think buried in this discussion is a very legitimate question about how write_json and read_json works, and if they should be improved. historically, I have had a bit of a hard time using these engine functions, so I would be open to discussing (maybe in a separate thread) how we can optimize these. ESPECIALLY now that we have structs and an ALMOST complete definition for defining struct literals

cjllanwarne commented 5 years ago

Relatedly, I quite like the idea of some kind of "standard library" of tasks, which could presumably do many of the things which people would want from "extension" functions, without the need to re-write them individually each time someone is writing a WDL.

Running a full task just to append to an array sucks, but from an engine/performance point of view I could also see (if it turns out to be important) a small set of "known" tasks be optimized either ad-hoc (eg with a meta hint to "run locally") or ahead of time by knowledgeable engines (eg "I was pre-programmed to know I can achieve the same result as this append task without needing to actually run this command").

The nice thing about doing this "in engine" from my point of view is, people don't need to know what the magic is, it just happens for them. All they'd need to know is where the standard library is and how to import its tasks.

cjllanwarne commented 5 years ago

Just to clarify: when I say "standard library of tasks", I mean being able to write something like import http://openwdl.org/standard_library/arrays/append.wdl at the top of your WDL to import an array appending task, for example

DavyCats commented 5 years ago

@cjllanwarne Over at @BioWDL we have a common.wdl for such things. I'm not against having a standard library of commonly used tasks, although we might want to consider naming it differently as to not confuse the engine functions (currently refered to as the standard library) with these standard wdl tasks.

vdauwera commented 5 years ago

Should we consider a package repository to centralize task libraries? We’re planning to make such a library for GATK and probably a couple of other popular bioinfx tools. Would be nice for people to be able to look for what’s available in a single place.

And maybe that’s just dockstore, but with something to identify task libraries from workflows. Thoughts?

On Wed, May 1, 2019 at 9:46 AM DavyCats notifications@github.com wrote:

@cjllanwarne https://github.com/cjllanwarne Over at @biowdl https://github.com/biowdl we have a common.wdl https://github.com/biowdl/tasks/blob/develop/common.wdl for such things. I'm not against having a standard library of commonly used tasks, although we might want to consider naming it differently as to not confuse the engine functions (currently refered to as the standard library) with these standard wdl tasks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openwdl/wdl/issues/312#issuecomment-488286237, or mute the thread https://github.com/notifications/unsubscribe-auth/AAU7AEZGZKZSH34HDBHQYNLPTGNKZANCNFSM4HIORIXQ .

-- Geraldine A. Van der Auwera, Ph.D. Associate Director of Outreach and Communications Data Sciences Platform Broad Institute

geoffjentry commented 5 years ago

@orodeh One question I have - I'm interpreting the purpose of the function as a hint to the engine "Hey, you can run this in band if you want", does that sound right? If so, why couldn't that just always be true of a task? (In fact, as far as I'm concerned, that's already the case).

My $0.02 - I already wish the stdlib wasn't as grandiose as it is, but have come to terms with its existence. There's a tipping point somewhere between "easy to grok DSL" and "might as well use a GP language" and IMO we're hovering around that point.

I do quite like the idea of bringing more attention to projects like @biowdl and encouraging people to place their libraries of useful WDL tasks into Dockstore.

orodeh commented 5 years ago

This is exactly what I meant by function --- a hint that the engine can run a task locally. This annotation is needed because, in the general case, the engine can't figure out if a new instance is needed or not. The kinds of tasks we are talking about here involve running an arbitrary evaluation on a bunch of WDL expressions. If you run it in the engine, and the evaluation does not terminate, then the whole engine is stuck.

By the way, a question about a library function for find in arrays was recently posted in the WDL forum. I think this issue is not going away.

DavyCats commented 5 years ago

@orodeh I think runtime attributes would serve this purpose (letting the engine know a task should be run locally) well enough. We could model an attribute after cromwell's backend concept and standardize a Local backend.

patmagee commented 5 years ago

@orodeh I can see the desire to have a task run locally, and I think if we define this as a hint, then it would be fine. My main concerns with that would certain around the performance impact that running a task locally could have. As long as the behaviour was completely opt in able then its not that bad. I personally would favor just starting a new VM entirely even for a small task as we do normally.

As for the find in arrays library function, I have seen this come up once or twice. Like other questions like this, the majority of people try to use WDL as a general purpose programming language and are surprised when there is not that standard functionality. To me, Find in array is not a required function. A lot of these questions may be addressed by proper education around writing WDL files. For this case if they know the exact name of the file, then they likely could have also specified the file separately at some point and this problem would go away

orodeh commented 5 years ago

In light of this discussion, I suggest the following two ideas:

A standard annotation for letting the engine know that a task may be run locally.
A standard library function to convert an arbitrary WDL data structure to JSON. An inverse function to convert properly formatted JSON back to WDL.

Perhaps we can use the existing read_json and write_json stdlib functions for (2). I say this carefully because I thought these actually work on arbitrary WDL, but they currently do not. That may be due to a good reason that I am unaware of.

We could stop here. I think that a topic for future discussion is a limited form of stdlib extension. Functions that do not take lambda expressions, but do perform data structure manipulation.

For example: length to work on optional arrays (#234) adding and removing elements from arrays (#170)

but not: map over an array (#203)

I believe the majority view is that we do not want to support lambda expressions. If so, it would be good to codify.

aednichols commented 5 years ago

For what it's worth @orodeh the write_json issue in Cromwell is known to us and we consider it a deviation from the spec that should be corrected.

aednichols commented 5 years ago

Also TIL that a single literal is valid JSON

Screenshot-2019-5-15 The JSON Validator

patmagee commented 5 years ago

@orodeh I am closing this since it looks like for the time we are decided.

For your first point, #315 should actually provide a mechanism to to let your BE know that the task CAN be run locally. this approach does not enshrine it in the spec, but allows minor optimizations through hints.

Additionally, for your 2nd point, i agree with you that should be supported. IF you would not mind making a PR or clarifying the current spec if this is not the current behaviour.

jtratner commented 5 years ago

Continuing this discussion slightly, I do wish there was a way to communicate to the engine that you're just shuffling files, rather than attempting to re-upload them. (i.e., to split on lanes, etc)

openwdl / wdl

Extending the standard library #312