nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 630 forks source link

Join behaviour non-deterministic if key contains a nextflow.util.ArrayBag object #5187

Closed robsyme closed 2 months ago

robsyme commented 3 months ago

Bug report

When the join operator is joining channels where the join key is a Map, the comparison can fail in a nondeterministic manner.

Inside the join operator, we get the key from the buffer object here:

https://github.com/nextflow-io/nextflow/blob/cc6ec3142de0131b7dbeb8fed38b0e6506e86bdc/modules/nextflow/src/main/groovy/nextflow/extension/JoinOp.groovy#L180

This get operation uses .equals() to compare the argument to the keys in the buffer Map. The Nextflow script below produces keys where:

Because the Map get method uses .equals() to compare objects, the join operator fails to join the elements correctly in some cases.

Expected behavior and actual behavior

The example workflow below correctly outputs two elements in the JOINED: channel only about half the time. You may need to run the workflow multiple times to see the nondeterminism.

#!/usr/bin/env nextflow

workflow {
    Channel.of(
        [[id:"id1"], "value1", "item1"],
        [[id:"id1"], "value2", "item2"],
        [[id:"id2"], "value3", "item3"]
    )
    | groupTuple
    | map { meta, channel_value, list ->
            newMeta = meta + [list: list]
            [newMeta, channel_value.flatten()]
        }
    | view { it -> 
        """
        CH1: ${it} 
            key = ${it[0]}
            hashCode == ${it[0].hashCode()}) 
        """.stripIndent()
    }
    | set { ch1 }

    Channel.of(
        [[id:"id1"], "another1", "item1"],
        [[id:"id1"], "another2", "item2"],
        [[id:"id2"], "another3", "item3"]
    )
    | groupTuple
    | map { meta, channel_value, list ->
            newMeta = meta + [list: list]
            [newMeta, channel_value.flatten()]
        }
    | view { it -> 
        """
        CH2: ${it} 
            key = ${it[0]}
            hashCode == ${it[0].hashCode()}) 
        """.stripIndent()
    }
    | set { ch2 }

    ch1
    | join(ch2)
    | view { result -> "JOINED: ${result}" }
}

I would expect that every time I run the workflow, the final "joined" channel would always return two items:

JOINED: [[id:id1, list:[item1, item2]], [value1, value2], [another1, another2]]
JOINED: [[id:id2, list:[item3]], [value3], [another3]]

However, sometimes the channel only returns one item (usually the first):

JOINED: [[id:id1, list:[item1, item2]], [value1, value2], [another1, another2]]

Note that every time that the final channel only returns one item, the value returned by the hashCode() method on the seemingly-identical keys is different.

Program output

If I add in the following snippet into L178 here:

https://github.com/nextflow-io/nextflow/blob/cc6ec3142de0131b7dbeb8fed38b0e6506e86bdc/modules/nextflow/src/main/groovy/nextflow/extension/JoinOp.groovy#L177-L184

for (key in buffer.keySet()) {
    log.info """
    Comparing $key (${key.getClass()}) to ${item0.keys} (${item0.keys.getClass()})
        left.equals(right) = ${key.equals(item0.keys)}
        left == right      = ${key == item0.keys}
    """.stripIndent()
}

The incorrect output looks like:

CH2: [[id:id1, list:[item1, item2]], [another1, another2]] 
    key = [id:id1, list:[item1, item2]]
    hashCode == 2089343703) 

CH1: [[id:id1, list:[item1, item2]], [value1, value2]] 
    key = [id:id1, list:[item1, item2]]
    hashCode == 2089343703) 

CH2: [[id:id2, list:[item3]], [another3]] 
    key = [id:id2, list:[item3]]
    hashCode == 1991786182) 

CH1: [[id:id2, list:[item3]], [value3]] 
    key = [id:id2, list:[item3]]
    hashCode == 934795724) 

Comparing [[id:id1, list:[item1, item2]]] (class java.util.ArrayList) to [[id:id1, list:[item1, item2]]] (class java.util.ArrayList)
    left.equals(right) = true
    left == right      = true

JOINED: [[id:id1, list:[item1, item2]], [value1, value2], [another1, another2]]

Comparing [[id:id1, list:[item1, item2]]] (class java.util.ArrayList) to [[id:id2, list:[item3]]] (class java.util.ArrayList)
    left.equals(right) = false
    left == right      = false

Comparing [[id:id1, list:[item1, item2]]] (class java.util.ArrayList) to [[id:id2, list:[item3]]] (class java.util.ArrayList)
    left.equals(right) = false
    left == right      = false

Comparing [[id:id2, list:[item3]]] (class java.util.ArrayList) to [[id:id2, list:[item3]]] (class java.util.ArrayList)
    left.equals(right) = false
    left == right      = true
robsyme commented 3 months ago

The error occurs because buffer is a HashMap, so we're storing the keys by their hash, and the hash differs between these identical looking objects.

robsyme commented 3 months ago

I think the problem here is that the key in this map entry: list:[item3] i.e. [item3] is an object of the class nextflow.util.ArrayBag which has unexpected hashing properties.

I tried overriding the hashCode method in the ArrayBag class

@Override
int hashCode() {
    target.hashCode()
}

But this did not affect the inability of the JoinOp class to get(item0.keys) correctly 100% of the time.

robsyme commented 3 months ago

An even more minimal example might be:

bag1 = new nextflow.util.ArrayBag('lunch')
bag2 = new nextflow.util.ArrayBag('lunch')

myMap = [:]
myMap[bag1] = 'sandwich'

found = myMap[bag2]
if(!found) {
    log.info "This hash: ${myMap} does not have key ${bag2}!"
} else {
    log.info "found: $found"
}

... which of course prints

This hash: [[lunch]:sandwich] does not have key [lunch]!
robsyme commented 3 months ago

Adding the hashCode and equals methods to ArrayBag will allow the objects to be used interchangeably as keys in LinkedHashMaps which would resolve the problem. The code is even already ready to go:

https://github.com/nextflow-io/nextflow/blob/cc6ec3142de0131b7dbeb8fed38b0e6506e86bdc/modules/nextflow/src/main/groovy/nextflow/util/ArrayBag.groovy#L74-L107

Any reason why we wouldn't simply uncomment those lines?

I've tested and re-adding these methods to the ArrayBag class resolves the issue.

robsyme commented 3 months ago

Actually, just explicitly delegating to the target works as well (implemented in #5189).

It's unclear to me why the @Delegate annotation is not delegating the equals and hashCode methods. I suspect these are methods provided by a higher interface and the interfaces = false argument to the annotation is preventing them being called.

pditommaso commented 3 months ago

TLDR; groovy by-passes the equals and hashCode method when checking the object identity because it implements its own strategy for collection objects

https://issues.apache.org/jira/browse/GROOVY-9003

robsyme commented 3 months ago

Ah! Thanks! Where would I see how equals and hashCode are (implicitly?) implemented for ArrayBag?

pditommaso commented 3 months ago

If I'm not wrong identity for collection objects is implemented via this method

https://github.com/apache/groovy/blob/a16f1a9ce6291ee6330427d8f0f22a392252f90e/src/main/java/org/codehaus/groovy/runtime/typehandling/DefaultTypeTransformation.java#L821