Add List Comprehension to the SPEC

patmagee commented 3 months ago

A common pattern that has evolved in WDL is the need to convert a list of type A into a list of type B. Most programming languages support list comprehension to accomadate this, but WDL has no built in method for it. Coincidentally we have already implemented a similar pattern with ternary operators (if-then-else) but have not gone as far as list comprehension

Currently, you can fake list comprehension using a scatter

List[Sample] samples

scatter(sample in samples){
  String sample_names = sample.name
}

# List[String] sample_names

While this is technically a form of list comprehension, it has a few drawbacks

Additional boilerplate
Assignment to an extra non list variable may be confusing (we should address this in the future)
Cannot be used within expressions

Support List Comprehension Directly

The simplest strategy we can take is to support list comprehension directly within WDL, potentially following the Python syntax

List[Sample] samples
List[String] sample_names = [sample.name for sample in samples]

One unintended consequence (for better or worse) of the above example is that it adds for as a reserved KW to the specification, without introducing a for construct. I think this is an acceptable tradeoff, but if we wanted to keep with existing semantics we could do something

List[String] sample_names = [sample.name scatter(sample in samples)]
List[String] sample_names = [scatter(sample in samples) sample.name]
# maybe there is a different mode of scatter
List[String] sample_names = scatter(sample in samples) { sample.name }

An additional benefit to list comprehension is the ability to easily add filtering

List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples) if sample == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples if sample == "blood")]
List[String] sample_names = scatter(sample in samples if sample == "blood" ) { sample.name }

patmagee commented 3 months ago

It does not resemble python in the same way, But I personally kind of like [sample.name scatter(sample in samples if sample == "blood")], because it would potentially allow the following pattern:


# Filtered scatters without reassignment!
scatter(sample in sample if sample == "blood"){
  call foo
  call bar
  call biz
}

jdidion commented 3 months ago

I like adding the filtering syntax to scatter. You could do:

Array[String] samples = ["blood", "tissue", "brain"]
scatter (sample in samples) {
  if (sample == "blood") {
    String sample_type = sample
  }
}

But then you end up with Array[String?] sample_type = ["blood", None, None], whereas

scatter (sample in samples if sample == "blood") {
  String sample_type = sample
}

would (presumably) yield Array[String] sample_type = ["blood"]. It's nice to not have to then select_all.

As for list comprehension, I'm supportive. I'd suggest dropping the parens and just using scatter as an alias for for. Or we could introduce for as a language keyword that would essentially be an alias for scatter.

List[String] sample_names = [sample.name scatter sample in samples if sample.type == "blood"]

or

List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]

jdidion commented 3 months ago

I think the scatter keyword carries a connotation of the computation being distributed (even though that's not specified as a requirement of scatter - the engine has the option of executing it locally). So it might be a good idea to add the for keyword with the explicit requirement that for expressions are evaluated locally.

patmagee commented 3 months ago

@jdidion although the implication with scatter is that the computations WITHIN the scatter are executed remotely but not the scatter itself

DavyCats commented 3 months ago

If this filtering functionality gets added to the scatter, is there really still a point in adding a separate list comprehension?

scatter(sample in samples if sample.include){String sample_names = sample.name}

and

Array[String] sample_names = [sample.name for sample in samples if sample.include]

would have the same effect.

Don't get me wrong, I like list comprehensions, but in this case it seems like it's just adding more syntax for people to learn. While the same effect can be achieved with already existing (except for the filtering) and equally concise syntax.

Regarding the distinction between scatter and for: I my mind, scatter implies simultaneous/parallel execution (regardless of whether that is done locally or distributed), whilst for implies consecutive execution (ie. the tradition programming for-loop). So every "iteration" of a scatter is independent and disconnected from every other "iteration". While an iteration of a for-loop may be impacted by the previous iterations.

patmagee commented 3 months ago

@DavyCats good question and I would say they are quite distinct, but the utility of the list comprehension is far greater.

Scatters are confined to workflows and introduce an inner scope and any variables set there will be available in the outer scope

List comprehension on the other hand

is concise
usable in tasks
usable within expressions
does not introduce anything within the outer scope

What I just realized is that we already have an implicit iterator foo in bar. So instead of adding new ways to refer to the same operation (ie a for keyword)maybe we just extend that

foo in bar if foo > 3: foo * 2

This can be the base form that fits really well into an existing scatter

scatter(foo in bar if foo > 3: foo * 2) {
  ...
}

Then you can use this in an expression directly

Array[File] files
Array[String] contents = [file in files if size(file, "mb") < 5: read_string(file)]

jdidion commented 3 months ago

@patmagee this could work. I'm not sure about : (which we don't use anywhere else in wdl) vs. {}. E.g.

Array[String] contents = [file in files if size(file, "mb") < 5 { read_string(file) }]

Another thing to consider is an implicit iterator variable, e.g. it in Groovy. We already have input as a keyword, so we could repurpose that as an iterator variable:

Array[String] contents = [read_string(input) in files if size(file, "mb") < 5]

patmagee commented 3 months ago

I wonder if it's better to just bite the bullet and introduce for as a keyword. Having sequential execution has been something that has been requested anyway so it could be a small step towards actual for loops

patmagee commented 3 months ago

I feel like without introducing a more complex concept like map we are pigeonholed into introducing for, only for the sake of simplicity and consistency with user expectation

Array[String] = [ read_string(input) for file in files if size(file, "mb") < 5) ]

I am on the fence of whether we should add a scatter form at all, or keep the List comprehension constrained to the expression. While the fact that we can use it right in a scatter looks nice

# Simple form
scatter(foo in bar if foo > 3){

}

# mutate
scatter(foo * 2 for foo in bar if foo > 3){

}

It inevitably opens the door to a for scatter, that kind of looks weird

for (foo for foo in bar){

}

So it may actually be simpler to just use list comprehension inline. Although there is also the recursion issue again: x in .. b in ..

scatter(foo in [b * 2 for b in bar if b > 3]){

}

jdidion commented 3 months ago

I agree with your last point - not necessary to change scatter syntax if we have list comprehension.

openwdl / wdl

Add List Comprehension to the SPEC #651

Support List Comprehension Directly