pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.51k stars 1.68k forks source link

Add tests for writing-then-reading randomly-generated dataframes #16121

Open DeflateAwning opened 1 month ago

DeflateAwning commented 1 month ago

Description

Related to Issue #16109 (very broken parquet files).

Can we please add "unit" tests (or rather integration tests) like this for every reader/writer (e.g., read/write_parquet, read/write_ndjson, etc.)? Ideally they'll run >10 times each with >10 different random generations, and perhaps a few different structures (some datetimes, etc.).

The non-deterministic failures in the write_parquet function could have been caught with this test, and it's so basic to implement and so useful in checking that the entire write-to-read path works properly.

import tempfile
import polars as pl

with tempfile.NamedTemporaryFile() as f:
    for n in range(10):
        print(f"Run #{n + 1}: ", end="")

        df = pl.DataFrame({
            "a": pl.Series(["123", "abc", "xyz"]).sample(50_000, with_replacement=True)
        }).with_row_index()

        df.write_parquet(f.name)
        f.seek(0)

        assert df.equals(pl.read_parquet(f.name))
ritchie46 commented 1 month ago

Yes, I think we need an hypothesis test for this one. Creating different data-types, nesting types and file formats and see if we can round-trip them.

Pinging @stinodego as he is just working on this.

stinodego commented 1 month ago

I'll add these when https://github.com/pola-rs/polars/pull/16062 is merged.

DeflateAwning commented 1 month ago

Looks like that one's merged now! Curious if there's any progress on this otherwise?