noirello / pyorc

Python module for Apache ORC file format
Apache License 2.0
64 stars 20 forks source link

Allow whitespace and newline in ORC schema #27

Closed ericsun2 closed 3 years ago

ericsun2 commented 3 years ago

Right now, the ORC schema must be specified in a single line without any whitespace, for example:

schema1 = """struct<col0:int,col1:string,col2:struct<col3:int,col4:string,col5:array<float>>,col6:map<string,int>,col7:bigint,col8:boolean,col9:timestamp>"""

Otherwise the module will through an error. It will really great to allow whitespace and newline, for example:

schema1 = """struct<
  col0:int, 
  col1:string,
  col2:struct<
    col3:int,
    col4:string,
    col5:array<float>>,
  col6:map<string,int>,
  col7:bigint,
  col8:boolean,
  col9:timestamp>"""
noirello commented 3 years ago

You can always remove it manually before passing it to the Writer:

schema1 = "".join(schema1.split())
writer = Writer(output, schema1) 

If I add this feature to the module, I'd like to keep it optional. Automatically removing whitespaces might cause some problems. So something like this:

schema1 = TypeDescription.from_string(schema1, remove_whitespace=True).
writer = Writer(output, schema1)

Considering that both of the solutions would require an extra line, I'm not sure that the latter one would be so beneficial.

ericsun2 commented 3 years ago

Ideally, the underlying ORC C++ library needs to accept whitespace and newline.

The split+join approach can be insufficient if we want to add column COMMENT in the schema text (like Avro).

The schema text could possibly come from a Schema Registry Service and equipped with whitespaces & newlines & comments already. The actual logic to remove whitespace will still need to parse the ORC/Hive schema text first, and then output it without any whitespaces.

noirello commented 3 years ago

The C++ library doesn't even support special characters - let alone whitespaces, when it comes to parsing schema strings. Is there even a standard way to add comments to an ORC schema? I've never seen one so far.

I'd rather not implement something special that doesn't follow any standards.