teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 2 forks source link

Add hdfs save/load format parameter with CSV option #244

Closed eemhu closed 4 months ago

eemhu commented 5 months ago

Description

Allow saving to HDFS in CSV format.

Use case or motivation behind the feature request

Can be used to save teragrep results in a standard format.

Related issues

https://github.com/teragrep/pth_10/issues/132

Additional context

JSON also could be added, but CSV seems to be the higher priority for now. Loading should also be supported via hdfs load.

eemhu commented 5 months ago

internal pull request submitted for hdfs save/load CSV format support.

eemhu commented 5 months ago

Should add documentation and support for adding files manually to HDFS and allow loading them with HDFS load, same for saving with HDFS save and using them without HDFS load.

eemhu commented 4 months ago

Also add HEADER=BOOL parameter for load and save to allow csv header configuration. Allow providing schema in hdfs load. If HEADER=FALSE and no schema provided, load all to _raw column.

eemhu commented 4 months ago

Internal PR up for review