We're instances instances recently of task threads hanging while waiting for shuffle results. It seems related to SPARK-26713. It also matches the difference we have with upstream's latest 2.4.x release.
There is even a race condition with ShuffledRDD + PipedRDD: the ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin writer thread, which leaks memory.
But IIUC it might have introduced a race condition causing hangs which was later corrected here (this is what we're getting in this PR): https://github.com/apache/spark/pull/25049
It was easier to revert the commit that introduced the first fix, and then cherry-pick the combined back-port from here: https://github.com/apache/spark/pull/25825 (as opposed to just take the correction).
This is a clean cherry-pick of https://github.com/apache/spark/pull/25825, back-ported to our 2.x branch.
We're instances instances recently of task threads hanging while waiting for shuffle results. It seems related to SPARK-26713. It also matches the difference we have with upstream's latest 2.4.x release.
It was easier to revert the commit that introduced the first fix, and then cherry-pick the combined back-port from here: https://github.com/apache/spark/pull/25825 (as opposed to just take the correction).