Add `sink_delta` to write delta table

ion-elgreco commented 9 months ago

Problem description

This seems to be possible since parquet is the storage format but perhaps there are some caveats there, where creating commits in transaction log needs to be done in rust..

stinodego commented 9 months ago

I will accept this, but I don't expect to implement this anytime soon because we currently rely on the deltalake package to do the writing, which cannot integrate nicely with our streaming engine.

ion-elgreco commented 9 months ago

I will accept this, but I don't expect to implement this anytime soon because we currently rely on the deltalake package to do the writing, which cannot integrate nicely with our streaming engine.

That's fine, maybe in the future when Delta-RS can support ADBC it would be easier to build on top of.

ion-elgreco commented 7 months ago

@stinodego do you think it's possible with the current code base to return a RecordBatch Generator as output of a streaming engine?

If so, I would happily give it a try (I am touching more and more Rust nowadays so maybe I can build it myself). I do probably need some pointers in which code paths I need to look since the Pola-rs codebase is so large :D

Actually, this is probably going to be a difficult one to start with.. I'll start with some smaller ones first

martroben commented 2 months ago

Upvote! This feature is getting more relevant, as more and more Azure users are migrating to OneLake, which uses the delta table format.

ion-elgreco commented 2 months ago

A path forward, is we do something similar as we do in deltalake. Instead of using the pyarrow engine, we write the parquet files with Polars and then just create a write transaction with the private api's of deltalake. But for this to work Polars needs to first add all these things:

Partitioned sink
Allow to hook in a file_visitor
Parquet writer should be able to write multiple parquet files if it reaches the max file size or max_rows_per_file.

Without these things it wouldn't be possible : )

Dekermanjian commented 1 month ago

I would also like to upvote this

edgBR commented 1 month ago

This will be very nice to have.

In fact write delta might have a problem writing large datasets currently as I am getting this error:

Traceback (most recent call last):
  File "<string>", line 197, in <module>
  File "<string>", line 193, in main
  File "/opt/miniconda/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/opt/miniconda/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/opt/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "preprocess/main.py", line 580, in <module>
    run(args)
  File "preprocess/main.py", line 520, in run
    final_data_stream.write_delta(
  File "/opt/miniconda/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3708, in write_delta
    data = self.to_arrow()
  File "/opt/miniconda/lib/python3.10/site-packages/polars/dataframe/frame.py", line 1224, in to_arrow
    record_batches = self._df.to_arrow()
pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.

Despite having polars-u64-idx installed, and also I get this error despite my dataset being less than 2^32 rows. When I batch manually this works but of course that is not the whole point of it.

pola-rs / polars

Add `sink_delta` to write delta table #11039

Problem description