luben / zstd-jni

JNI binding for Zstd
Other
853 stars 168 forks source link

Stuck when parallel compression is enabled #298

Closed pan3793 closed 6 months ago

pan3793 commented 8 months ago

Apache Spark exposed the spark.io.compression.zstd.workers parameter recently in https://github.com/apache/spark/pull/44172, we backported this patch to internal Spark 3.1.2 with zstd-jni-1.5.4-2.

While after rolling this patch to a production cluster with

spark.io.compression.codec=zstd
spark.io.compression.zstd.level=5
spark.io.compression.zstd.workers=4

I observed some Spark tasks hang at

image

I checked the node's load and memory, which is high but should still be normal for an offline computing cluster.

I may can not provide more information such as the native thread stacks, because this change was revoked and I didn't meet such an issue during the small scale benchmark.


UPDATE

It's not a real hang, just the task time becomes extremely long, maybe related to the data itself. When I change the zstd.level from 1 to 5 and zstd.workers from 0 to 4, compressing the same data takes from ~10min to ~5 hours, while for most of the tasks, such zstd configuration change would cause the shuffle write time increase ~2x, e.g. from ~5min to ~10 mins

luben commented 8 months ago

Thanks for the feedback. Yes, increasing the compression level affects the compression speed and the number of workers may not be able to compensate for it. From my experiments the improvement is not linear with the number of threads