palantir / spark

Palantir Distribution of Apache Spark
Apache License 2.0
67 stars 51 forks source link

[SPARK-26713][CORE][followup] revert the partial fix in ShuffleBlockFetcherIterator #741

Closed rshkv closed 3 years ago

rshkv commented 3 years ago

This is a clean cherry-pick of https://github.com/apache/spark/pull/25825, back-ported to our 2.x branch.

We're instances instances recently of task threads hanging while waiting for shuffle results. It seems related to SPARK-26713. It also matches the difference we have with upstream's latest 2.4.x release.

There is even a race condition with ShuffledRDD + PipedRDD: the ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin writer thread, which leaks memory.

It was easier to revert the commit that introduced the first fix, and then cherry-pick the combined back-port from here: https://github.com/apache/spark/pull/25825 (as opposed to just take the correction).

rshkv commented 3 years ago

Thank you, @LorenzoMartini.