ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.21k stars 591 forks source link

bug - Regression in `.sample()` #10294

Open rapatel0 opened 1 week ago

rapatel0 commented 1 week ago

What happened?

It appears that ibis.sample is returning the whole dataframe and ignoring the fraction parameter.

I've also tested with a to_parquet('test_data.parquet') call and that saves a full copy of the dataset unsampled.

ipython test:

In [4]: ibis.__version__
Out[4]: '9.5.0'

In [5]: df = con.read_parquet("s3://sdm-threat-mlflow/data_more.parquet")

In [6]: df.count().execute()
Out[6]: 53214

In [7]: df.sample(0.1).count().execute()
Out[7]: 53214

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

Duckdb with s3fs filesystem.

Environment is setup with this pre-script (env vars store S3 variables):

import s3fs
import ibis
import numpy as np
import pandas as pd

fs = s3fs.S3FileSystem(anon=False)

con = ibis.duckdb.connect(":memory:")
con.register_filesystem(fs)

print("available variables: ")
print("`fs` - S3FS Object initialized from environmental variables")
print("`con` - Ibis duckdb connection with s3fs initialized")

Relevant log output

Command returns with no errors or log output

Code of Conduct

akanz1 commented 1 week ago

Cant reproduce doing the following. Do you notice anything you did differently?

if __name__ == "__main__":
    credentials = config.DevConfig.get_s3_credentials()
    fs = fsspec.filesystem(
        "s3",
        key=credentials.access_key,
        secret=credentials.secret_access_key,
    )
    con = ibis.duckdb.connect()
    con.register_filesystem(fs)

    df = con.read_parquet("s3://my-bucket/parquet-files/some.parquet")

    count = df.count().execute()
    sampled_count = df.sample(0.1).count().execute()

    print(f"{count=}")
    print(f"{sampled_count=}")
    con.disconnect()

Outputs:

Note: Sample count varies between executions as ibis includes each row with a probability of 'fraction'. https://ibis-project.org/reference/expression-tables.html#ibis.expr.types.relations.Table.sample

count=1000
sampled_count=115

fyi: also works using s3fs instead of fsspec directly

    fs = s3fs.S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_access_key)
rapatel0 commented 1 day ago

I had the fsspec variables set as environmental variables. Also i may have been using a a minio bucket. but nothing else seems different. Also, the file was still accessable so not sure that this has anything to do with the issue.

I'll test again today in a fresh install and see if I can reproduce again.