Closed caldeirav closed 1 week ago
Also check recipes from the maintainers of Iceberg at https://tabular.io/apache-iceberg-cookbook/pyiceberg-get-started-api/
We are interested in particular about working with pandas: https://tabular.io/apache-iceberg-cookbook/pyiceberg-pandas/ as well as the recent ability to support writes: https://tabular.io/apache-iceberg-cookbook/pyiceberg-writes/
Example repo: https://github.com/tabular-io/docker-spark-iceberg
CC @jpaulrajredhat @avinashsingh77
@caldeirav Already I tried both PySpark and pyIceberg . As of now pyiceberg supports limited write functionality but supports more query functionality. PySpark is most efficient for data ingestion for both real time data streaming and batch processing. So, we can use pySpark for data ingestion for large volume of data and use pyIceberg for query to expose data product as api.
The example git repo that I send it you based on pySpark which writes data to iceberg-minIo using hive metaStore and data frame. You don't need to write a sql, pySpark uses data frame to write data to iceberg. pyiceberg does the same , but limited write feature. As of now pyIceberg community is very small and less mature.
@caldeirav added notebook example for both pyspark which runs on single node spark cluster and pyIceberg just python
you can find both example on this notebook deployment https://pyspark-standalone-datamesh-demo.apps.rosa-8grhg.ssnp.p1.openshiftapps.com Pyspark to ingest data into minio thrugh hive-iceberg - > spark-iceberg.ipynb PyIceberg query the data from iceberg using same hive catalog --> pyiceberg-query.ipynb
PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. Using this library can enable us to write data tables direcly from PyArrow dataframes. PyIceberg currently has native support for REST, SQL, Hive, Glue and DynamoDB-based catalogs so suggestion is to use the current hive catalog for testing. This could provide an alternative to the use of spark (simpler but less mature).
https://py.iceberg.apache.org/