kwai / blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Apache License 2.0
1.3k stars 122 forks source link

Introduce the zstd codec for native spill #656

Open zuston opened 3 days ago

zuston commented 3 days ago

Is your feature request related to a problem? Please describe.

In current codebase, the lz4 codec is used in the spill. zstd should be supported.

Additional context

I will do this if no rejection from project owner.

richox commented 2 days ago

i suggest using a property other than spark.io.compression.codec since it is used in broadcast/shuffle where data goes through the network. for local spilling we would like to use a lightweight compression algorithm like lz4/snappy. i prefer a property like blaze.spill.compression.codec, what do you think?

richox commented 2 days ago

and have you done some benchmark using zstd spilling? it will get worse performance than lz4/snappy, if i don't understand wrong.

zuston commented 2 days ago

i suggest using a property other than spark.io.compression.codec since it is used in broadcast/shuffle where data goes through the network. for local spilling we would like to use a lightweight compression algorithm like lz4/snappy. i prefer a property like blaze.spill.compression.codec, what do you think?

Another option is acceptable.

and have you done some benchmark using zstd spilling? it will get worse performance than lz4/snappy, if i don't understand wrong.

Haven't. I'm still reading this part code.

zuston commented 2 days ago

And I think we still can reuse the IoCompressionReader/Writer . WDYT? @richox