Stuck when parallel compression is enabled

Apache Spark exposed the spark.io.compression.zstd.workers parameter recently in https://github.com/apache/spark/pull/44172, we backported this patch to internal Spark 3.1.2 with zstd-jni-1.5.4-2.

While after rolling this patch to a production cluster with

spark.io.compression.codec=zstd
spark.io.compression.zstd.level=5
spark.io.compression.zstd.workers=4

I observed some Spark tasks hang at

I checked the node's load and memory, which is high but should still be normal for an offline computing cluster.

I may can not provide more information such as the native thread stacks, because this change was revoked and I didn't meet such an issue during the small scale benchmark.

UPDATE

It's not a real hang, just the task time becomes extremely long, maybe related to the data itself. When I change the zstd.level from 1 to 5 and zstd.workers from 0 to 4, compressing the same data takes from ~10min to ~5 hours, while for most of the tasks, such zstd configuration change would cause the shuffle write time increase ~2x, e.g. from ~5min to ~10 mins

luben / zstd-jni

Stuck when parallel compression is enabled #298