[BUG] Unexpected output state when using `toState`

kverstae commented 3 months ago

What happened?

When using useState: ["foo": "output"], I expect the output state to look like this: (<ID>, ["foo": <PATH>])

What I got instead was (<ID>, ["foo": <PATH>, "input": <PATH>]) This is very confusing, since the path in the input-key is actually the input used for that module.

I don't know if this is the expected behavior, but I find it a bit weird that suddenly keys/values from the input state are also present in the output state. This will in most cases not cause any issues, but in our use-case we were joining 2 channels and merging their state like this: first_channel.join(second_channel) | map { it -> [it[0], it[1] + it[2] } Since the state of the second channel suddenly contained an unexpected extra key that also happened to be in the state of the first channel, the value we wanted was suddenly overwritten...

Steps to reproduce

I created a quick repo to showcase the 'issue': https://github.com/kverstae/viash_output_state_showcase

Just run bin/init followed by ./run.sh

Expected behavior

When using toState: ["foo": "output"], I expect the output state to only contain the key "foo" and definitely not keys from the input state.

Relevant log output

No response

Version

OS: Fedora 40
Java Version: 21.0.3
Viash Version: 0.8.6
Nextflow Version: 24.04.3

Possible solution

No response

Confirmation

[X] I have searched the existing issues to make sure this is not a duplicate.
[X] I have provided clear and concise information about the bug.

Additional context

No response

DriesSchaumont commented 2 weeks ago

Hi @kverstae thanks for filing this issue.

I agree that on first use it might seem weird that the output state is created by joining the output from the component into the input state. It is intentional behavour, but it could be documented more clearly. Just for a bit of context, I would like to add that it is possible to pass a closure to toState, which provides maximal flexibility. Here is an example that allows you not to join the state:

toState: {id, result, state -> ["foo": result.output]}

In fact, the hashmap syntax was implemented in order to provide an abbrevated syntax for the closures. For example,toState: ["foo", "output"] is equivalent to

toState: { id, component_result, input_state ->
  def newState = input_state + ["foo": component_result.output]
  return newState
}

Your question comes down to what the behavior is of the abbreviation: setting the state or joining the state. Ideally, the easier and less verbose syntax should be used for the most common occuring scenario (so that it gets the most use). In case of a linear pipeline, where a component only uses the input from a previous component, one only needs to set the state. However, more complex pipelines often require a component from a workflow to use the output from two or more components, in which case you need to store these values somewhere. In our experience, the latter scenario is the more common one, which is why the abbreviated syntax indicates joining. The difference between setting and joining the state was considered and ultimately an opinionated choice was made. I'm going to swap the 'bug' label for 'documentation' here, because we could showcase fromState and toState in a more clear manner.

kverstae commented 2 weeks ago

Thanks for the clear explanation @DriesSchaumont . Things make a lot more sense now!

I agree that this is a sensible default, since I have had many cases where the input for one module was also needed in the next. From the documentation, it just wasn't clear to me what was actually meant by 'state' in the context of toState. I assumed this were just the results of the module, but now I now better :)

viash-io / viash