kestra-io / kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
https://kestra.io
Apache License 2.0
7.59k stars 462 forks source link

`ParquetWriter` should infer schema from ION file in internal storage #1783

Open anna-geller opened 1 year ago

anna-geller commented 1 year ago

The ParquetWriter is currently too difficult to use. When some file is already stored as ION, Kestra should infer the schema and should not require schema specification (i.e., the schema can be added optionally).

image

tchiotludo commented 1 year ago

In fact, I think that Parquet is only a new encoding format based on the avro specification, but it seems that other encoding is possible, see this and this.

I'm just not sure about the others encoding used? Do you know what encoding is used when writing parquet with python pandas?

anna-geller commented 1 year ago

afaik, UTF-8 encoding and Apache Arrow are used behind the scenes

anna-geller commented 1 year ago

the main issue is that schema property is required in this task and there is no information what is expected here and how to use it. I didn't know how to use it because e.g. when writing a Pandas dataframe to a Parquet file, you don't even have to think about the schema, the schema is inferred from the dataframe