Open danielgafni opened 1 year ago
@chitralverma Could you take a look at this one?
sure, but I'm on leave at the moment so it may take some time.
my guess is that this is because of the parallel readers.
Interesting. Just to clarify, the same test passes with Parquet.
I tried adding a 0.5s sleep, but it didn't really help (see CI in https://github.com/danielgafni/dagster-polars/pull/10)
Hey @chitralverma, any idea on what's going on here? Even a 0.5s
sleep doesn't help. Is it normal?
P.S. Please excuse me if this ping is annoying, my guess was you might have forgotten about this issue because of your leave
Hi @danielgafni , I tried reproducing this but can't. can you please post again a minimum reproducible example with just the polars code?
also were you able to reproduce this if you don't use polars at all and just use delta-rs
directly?
Hey, actually I can't reproduce it without the hypothesis.given
decorator.
This test should be logically equivalent to the one I initially provided, but it passes with no errors:
import shutil
import polars as pl
import polars.testing as pl_testing
from _pytest.tmpdir import TempPathFactory
from polars.testing.parametric import dataframes
strategy = dataframes(
excluded_dtypes=[
pl.Categorical,
pl.Duration,
pl.Time,
pl.UInt8,
pl.UInt16,
pl.UInt32,
pl.UInt64,
pl.Datetime("ns", None),
],
min_size=5,
allow_infinities=False,
)
def test_polars_delta_io(tmp_path_factory: TempPathFactory):
for i in range(500):
tmp_path = tmp_path_factory.mktemp("data")
df = strategy.example()
assert isinstance(df, pl.DataFrame)
df.write_delta(str(tmp_path))
pl_testing.assert_frame_equal(df, pl.read_delta(str(tmp_path)))
shutil.rmtree(str(tmp_path))
As I know, hypothesis
doesn't parallelize tests execution, so this is surprising to me.
It seems like the issue is on the hypothesis
side tho?
Still happening in my CI even after I added a 0.1s
sleep between tests. It's rare but it happens.
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Sometimes the rows get shuffled when writing a DataFrame and reading it with deltalake:
Link to CI logs
Reproducible example
Expected behavior
This should not happen
Installed versions