Open ion-elgreco opened 9 months ago
I will accept this, but I don't expect to implement this anytime soon because we currently rely on the deltalake
package to do the writing, which cannot integrate nicely with our streaming engine.
I will accept this, but I don't expect to implement this anytime soon because we currently rely on the
deltalake
package to do the writing, which cannot integrate nicely with our streaming engine.
That's fine, maybe in the future when Delta-RS can support ADBC it would be easier to build on top of.
@stinodego do you think it's possible with the current code base to return a RecordBatch Generator as output of a streaming engine?
If so, I would happily give it a try (I am touching more and more Rust nowadays so maybe I can build it myself). I do probably need some pointers in which code paths I need to look since the Pola-rs codebase is so large :D
Actually, this is probably going to be a difficult one to start with.. I'll start with some smaller ones first
Upvote! This feature is getting more relevant, as more and more Azure users are migrating to OneLake, which uses the delta table format.
A path forward, is we do something similar as we do in deltalake. Instead of using the pyarrow engine, we write the parquet files with Polars and then just create a write transaction with the private api's of deltalake. But for this to work Polars needs to first add all these things:
Without these things it wouldn't be possible : )
I would also like to upvote this
This will be very nice to have.
In fact write delta might have a problem writing large datasets currently as I am getting this error:
Traceback (most recent call last):
File "<string>", line 197, in <module>
File "<string>", line 193, in main
File "/opt/miniconda/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/miniconda/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "preprocess/main.py", line 580, in <module>
run(args)
File "preprocess/main.py", line 520, in run
final_data_stream.write_delta(
File "/opt/miniconda/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3708, in write_delta
data = self.to_arrow()
File "/opt/miniconda/lib/python3.10/site-packages/polars/dataframe/frame.py", line 1224, in to_arrow
record_batches = self._df.to_arrow()
pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.
Despite having polars-u64-idx installed, and also I get this error despite my dataset being less than 2^32 rows. When I batch manually this works but of course that is not the whole point of it.
Problem description
This seems to be possible since parquet is the storage format but perhaps there are some caveats there, where creating commits in transaction log needs to be done in rust..