iterative / datachain

AI-data warehouse to enrich, transform and analyze data from cloud storages
https://docs.datachain.ai
Apache License 2.0
934 stars 55 forks source link

Unify `from_json` and `parse_tabular` implementations #545

Open dtulga opened 2 days ago

dtulga commented 2 days ago

This issue is to unify the existing from_json and from_jsonl implementations with the existing implementations in parse_tabular, from_csv, and from_parquet. This is to consolidate dynamic model generation and schema inference for these import functions. Current functionality (such as jmespath support) should be preserved, so the implementations likely cannot be identical between these import functions, but they should use similar dynamic model generation, schema inference, etc. and this should also ideally remove the dependency on datamodel-code-generator if possible.

dtulga commented 22 hours ago

This article may be helpful in the future, as it talks about pyarrow's support for JSON: https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html

shcheklein commented 22 hours ago

thanks @dtulga !