opendatahub-io-contrib / datamesh-platform

Apache License 2.0
4 stars 3 forks source link

Explore pyIceberg for Date Writes #25

Closed caldeirav closed 1 week ago

caldeirav commented 1 month ago

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. Using this library can enable us to write data tables direcly from PyArrow dataframes. PyIceberg currently has native support for REST, SQL, Hive, Glue and DynamoDB-based catalogs so suggestion is to use the current hive catalog for testing. This could provide an alternative to the use of spark (simpler but less mature).

https://py.iceberg.apache.org/

caldeirav commented 1 month ago

Also check recipes from the maintainers of Iceberg at https://tabular.io/apache-iceberg-cookbook/pyiceberg-get-started-api/

We are interested in particular about working with pandas: https://tabular.io/apache-iceberg-cookbook/pyiceberg-pandas/ as well as the recent ability to support writes: https://tabular.io/apache-iceberg-cookbook/pyiceberg-writes/

Example repo: https://github.com/tabular-io/docker-spark-iceberg

CC @jpaulrajredhat @avinashsingh77

jpaulrajredhat commented 1 month ago

@caldeirav Already I tried both PySpark and pyIceberg . As of now pyiceberg supports limited write functionality but supports more query functionality. PySpark is most efficient for data ingestion for both real time data streaming and batch processing. So, we can use pySpark for data ingestion for large volume of data and use pyIceberg for query to expose data product as api.

The example git repo that I send it you based on pySpark which writes data to iceberg-minIo using hive metaStore and data frame. You don't need to write a sql, pySpark uses data frame to write data to iceberg. pyiceberg does the same , but limited write feature. As of now pyIceberg community is very small and less mature.

jpaulrajredhat commented 1 month ago

@caldeirav added notebook example for both pyspark which runs on single node spark cluster and pyIceberg just python

you can find both example on this notebook deployment https://pyspark-standalone-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com Pyspark to ingest data into minio thrugh hive-iceberg - > spark-iceberg.ipynb PyIceberg query the data from iceberg using same hive catalog --> pyiceberg-query.ipynb