teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 6 forks source link

HDFS save / load to support compression #280

Open nwellingk opened 3 months ago

nwellingk commented 3 months ago

Description HDFS save to allow e.g. compress=true command (default false, optional parameter). HDFS load to automatically uncompress, if files are compressed.

Use case or motivation behind the feature request Some often used queries to speed up the searches demand a few gigabytes from disk. Any reduction to the usage would be beneficial and would not slow down the processing much.

Related issues

Additional context

51-code commented 3 months ago

Currently hdfs save / load commands support two formats: CSV and Avro.

Spark seems to by default support compression with snappy when using Avro format, so that might have already been in use. Should be investigated if that is actually the case.

CSV doesn't use compression by default like Avro does. There is a compression option available and it should be set to lz4, which is the best available.

Lz4 is not available in Avro format, so snappy will be used in that.

Should also be investigated if the compression option in DataFrameWriter has an effect on HDFS level.

51-code commented 3 months ago

Local testing concludes that Avro format does indeed already compress the data. Avro is the default format used if the format is not specified in the command. Interestingly, the file name isn't changed at all when using snappy for compressing.

CSV format does not do any compression currently, but local tests showed that simply using .option("compression", "lz4") in the DataStreamWriter works.

Spark can read the compressed files automatically without any changes to the hdfs load command in both the csv and avro formats.

51-code commented 3 months ago

Change of plans: the parameter is going to be named "codec" and it allows the user to choose the method of compression or the value "plain" for no compression. The command should throw an exception if a non-supported format + codec combination is used. The default codec will not be plain, it will be lz4 for csv and snappy for avro.