martindurant / daskberg

dask client for iceberg (super-alpha)
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Python client for iceberg

Installation

For now, you either need to clone the repo, or install directly from github

pip install git+https://github.com/martindurant/daskberg

This software is not released and very poorly tested. With luck, you might find it useful.

Quickstart

Let's say you have access to an Iceberg dataset at a known path. The "root" path of an Iceberg dataset is the one above the "metadata/" and "data/" directories. In this quickstart, I will demo with the test data included in this repo, so assume you are in the root directory of the repo.

In [14]: ORIG_DIR = "/Users/mdurant/temp/warehouse/db/my_table"
In [15]: ice = daskberg.ice.IcebergDataset("./test-data/my_table/", ORIG_DIR)

In [16]: ice.version  # latest version file found
Out[16]: 5

In [17]: ice.schema
Out[17]:
[{'id': 1, 'name': 'name', 'required': False, 'type': 'string'},
 {'id': 2, 'name': 'age', 'required': False, 'type': 'int'},
 {'id': 3, 'name': 'email', 'required': False, 'type': 'string'}]

In [18]: len(ice.snapshots)
Out[18]: 3

In [19]: ice.read()
Out[19]:
Dask DataFrame Structure:
                 name    age   email
npartitions=5
               object  Int32  object
                  ...    ...     ...
...               ...    ...     ...
                  ...    ...     ...
                  ...    ...     ...
Dask Name: read-parquet, 1 graph layer

In [20]: ice.read().compute()
Out[20]:
    name  age              email
0    Bob   20               None
0   John   56  email@email.email
0  Fiona   25               None
0  Roger   25               None
0   Alex   36               None

In [21]: ice.open_snapshot(-1)

In [22]: ice.read().compute()
Out[22]:
    name  age
0    Bob   20
0  Fiona   25
0  Roger   25
0   Alex   36

Some notes:

(the data were created with pyspark SQL and following a Dremio community tutorial)

What works

Testing was mostly done with fastparquet, which newly supports schema evolution.

Missing