Closed ericsun2 closed 3 years ago
You can always remove it manually before passing it to the Writer:
schema1 = "".join(schema1.split())
writer = Writer(output, schema1)
If I add this feature to the module, I'd like to keep it optional. Automatically removing whitespaces might cause some problems. So something like this:
schema1 = TypeDescription.from_string(schema1, remove_whitespace=True).
writer = Writer(output, schema1)
Considering that both of the solutions would require an extra line, I'm not sure that the latter one would be so beneficial.
Ideally, the underlying ORC C++ library needs to accept whitespace and newline.
The split+join approach can be insufficient if we want to add column COMMENT in the schema text (like Avro).
The schema text could possibly come from a Schema Registry Service and equipped with whitespaces & newlines & comments already. The actual logic to remove whitespace will still need to parse the ORC/Hive schema text first, and then output it without any whitespaces.
The C++ library doesn't even support special characters - let alone whitespaces, when it comes to parsing schema strings. Is there even a standard way to add comments to an ORC schema? I've never seen one so far.
I'd rather not implement something special that doesn't follow any standards.
Right now, the ORC schema must be specified in a single line without any whitespace, for example:
schema1 = """struct<col0:int,col1:string,col2:struct<col3:int,col4:string,col5:array<float>>,col6:map<string,int>,col7:bigint,col8:boolean,col9:timestamp>"""
Otherwise the module will through an error. It will really great to allow whitespace and newline, for example: