Best practices for multiple process inputs

bentsherman commented 1 year ago

Is your feature request related to a problem? Please describe

Howdy folks

I'd like to discuss a common convention I see with nf-core modules, where a process has two separate inputs for e.g. a sample and an index. Here are a few examples I found:

So you have two inputs:

    input:
    tuple val(meta), path(reads)
    tuple val(meta2), path(index)

This convention works fine as long as you have a single index, in which case you can provide the index as a value channel and it will be "broadcast" to every sample, basically an implicit cross product.

But what if you have multiple indices? The process inputs are not really set up to handle this, so you have to do a bit of hacking:

ch_samples = Channel.of( /* ... */ )
ch_indices = Channel.of( /* ... */ )

ch_inputs = ch_samples.combine(ch_indices)
ch_multi = ch_inputs.multiMap { it ->
    samples: it[0..2],
    indices: it[2..4]
}

PROC(ch_multi.samples, ch_multi.indices)

But now you're wondering if multiMap preserves the order of its inputs, and that question leads down a deep rabbit hole. I have now led multiple people through that rabbit hole, and every time it leads me back to the original problem of multiple inputs. It's the reason why I added this note to the docs.

It's not always nf-core modules that are the cause, just "someone else's process that I'm trying to re-use". In any case, I'm hoping that I can broach the subject and spread this best practice to the community. Are people aware of this issue? Have you debated over this convention in the past? If so I would prefer to build on whatever previous discussions were had.

By the way, here's how I think you SHOULD do it:

process PROC {
    input:
    tuple val(meta), path(reads), val(meta2), path(index)

    // ...
}

workflow {
    ch_samples = Channel.of( /* ... */ )
    ch_indices = Channel.of( /* ... */ )

    PROC( ch_samples.combine(ch_indices) )
}

Easy! It works in all cases (one-to-one, many-to-one, many-to-many), and it doesn't require you play fast and loose with your dataflow

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

jfy133 commented 1 year ago

There was a loooooooong discussion about this a looooong time ago (one of the first major discussions about the early nf-core guidelines I think).

I think a few people will jump on this later with their arguments about this (e.g. @nvnieuwk and @Emiller88 for example).

However there were two camps, @maxulysse leading the 'best practise' the one you mention, vs the 'separated' one (I think that was led by @drpatelh? ).

If I remember correctly (but I don't guarantee this though), the problem that the 'separated input' channel camp 'won' on was that depending on the module, it could lead to an very long/large number of (possibly optional!) tuple of index files that are reference specific having to be included in that one channel.

I think people didn't like this as while more 'intuitive' (this files belong to this reference), it because very long and unweildly to read and use. Furthermore when you have optional files, you end up having to a lot of channel manipulation anyway to insert the empty lists.

So having it split over multiple lines (i.e. channels) was preferred, as we thought at the time, this could be still dealt with with multiMap correctly... but if there is some dodgy behaviour that says it's not a great idea, we could probably discuss it again.

But I hope someone can correct me if I'm wrong, my memory isn't great atm.

nvnieuwk commented 1 year ago

I really like the flexibility of everything in the same input channel. We just have to figure out a way to keep the input readable for those modules that have a lot of inputs (looking at gatk4/haplotypecaller for example).

Something like this might work? I think it could be a great topic for the upcoming maintainers meeting though :)

input:
    tuple (
        val   meta, 
        path  input, 
        path  input_index,
        path  intervals, 
        path  dragstr_model,
        path  fasta,
        path  fai,
        path  dict,
        path  dbsnp,
        path  dbsnp_tbi
    )

matthdsm commented 1 year ago

I'm also a fan of having everything in one tuple! @bentsherman, do I remember correctly that we are going to be able to have named arguments for modules at some point in the future? That would make everything a whole lot readable IMO.

maxulysse commented 1 year ago

@bentsherman Love your idea, and that was my take on how we should have tackle this.

But we're a community, and we collectively decided to follow @drpatelh's idea on how to deal with it. And I'm actually happy with the current solution. For we, even if it's not perfect, we do have for now a working solution, and even if I like your idea, I wouldn't go for it.

That being said, if we have some more changes in Nextflow, like named input (modules/workflows level), and a nicer way to deal with non existing input (other than passing over empty values), then it would be another deal, and reworking the way we deal with multiple inputs in the modules will be a hill I'm willing to die for.

pinin4fjords commented 1 year ago

I am a reformed user of crazy-long channel tuples (though that was DSL1), and I would not want to encourage my past self to return to that! All kinds of mess happens. Plus I've not had to use the dark arts of multiMap very often.

That said, if joins/combines only happened when the modules were called (so we wouldn't change workflows much, just the module inputs), then I'd be cool with that...

FriederikeHanssen commented 1 year ago

Hey! It looks prettier, but I am relatively terrified of the amount of combine statements we would need to when calling each module and how error prone that might be

bentsherman commented 1 year ago

Thank you all for chiming in. I figured there was some history here, and I understand why you settled on the current approach. While the multiMap solution depends on the fact that Nextflow operators are single-threaded, this implementation detail is very unlikely to change IMO. Also, the nf-core CLI makes it easy to patch modules locally, if e.g. a user wants to take a different approach to process inputs. That being said, maybe we can find some syntax improvements that would satisfy both camps.

Long tuples are a minor aesthetic issue (by the way @nvnieuwk 's suggestion works if you add parens to each element), but optional elements are a larger concern, as they are simply not supported yet.

What I think will sort this all out is better support for record types as we are discussing in https://github.com/nextflow-io/nextflow/issues/2085. Then you wouldn't need to enumerate the tuple elements in the process, just the type. For example:

@ValueObject
class MyRecord {
    Map meta
    Path sample
    Path index1
    Optional<Path> index2
    Optional<Path> index3
}

process TEST {
    input:
    MyRecord in

    """
    echo ${in.meta} ${in.sample} ${in.index1} ${in.index3} ${in.index3}
    """
}

workflow {
    sample = new MySample(
        [id: 'me'],
        file('sample.fastq'),
        file('index1.idx'),
        Optional.empty(),
        Optional.of(file('index3.idx')),
    )
    TEST( sample )
}

Alternatively if we figure out nullable paths (https://github.com/nextflow-io/nextflow/pull/4293) then that would at least solve the problem of optional elements.

I can't offer a simpler alternative to the combination logic because combining is exactly what you are trying to do 😄 Value channels are a shortcut only if you have one index, and each is also a shortcut but I don't believe it is used much by nf-core (and IMO it should be removed anyway).

Would love to hear what you guys think about the above ideas as it pertains to resolving this debate over multiple inputs. Both of them are already on our roadmap, so they will happen eventually

edmundmiller commented 1 year ago

I wish it did a magical join behind the scenes:

process PROC {
    input:
    tuple val(meta), path(reads), path(index)

And

process PROC {
    input:
    tuple val(meta), path(reads)
    tuple val(meta), path(index)

Were equivalent. But I'm not sure whether that might give unintended effects, just like it does right now.

nvnieuwk commented 1 year ago

OOOH I like the record type solution. If we could package these with the modules we can cast the right files to it without having the bloat of one massive input tuple while still having all the benefits. Will it be possible to access these in the workflow scripts if it's defined in the module script?

bentsherman commented 6 months ago

Wanted to circle back on this thread as I think I've found a comprehensive solution to this multiple inputs problem, which goes beyond record types and into DSL3 territory... I have a POC for fetchngs here: https://github.com/nf-core/fetchngs/pull/309

It is way broader than just this thread, but wanted to share it here for any interested folks. The tl;dr, I think we can avoid the issue of multiple input channels (and long record/tuple inputs) by never calling processes directly with channels...

nf-core / modules