Open rapatel0 opened 1 week ago
Cant reproduce doing the following. Do you notice anything you did differently?
if __name__ == "__main__":
credentials = config.DevConfig.get_s3_credentials()
fs = fsspec.filesystem(
"s3",
key=credentials.access_key,
secret=credentials.secret_access_key,
)
con = ibis.duckdb.connect()
con.register_filesystem(fs)
df = con.read_parquet("s3://my-bucket/parquet-files/some.parquet")
count = df.count().execute()
sampled_count = df.sample(0.1).count().execute()
print(f"{count=}")
print(f"{sampled_count=}")
con.disconnect()
Outputs:
Note: Sample count varies between executions as ibis includes each row with a probability of 'fraction'. https://ibis-project.org/reference/expression-tables.html#ibis.expr.types.relations.Table.sample
count=1000
sampled_count=115
fyi: also works using s3fs instead of fsspec directly
fs = s3fs.S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_access_key)
I had the fsspec variables set as environmental variables. Also i may have been using a a minio bucket. but nothing else seems different. Also, the file was still accessable so not sure that this has anything to do with the issue.
I'll test again today in a fresh install and see if I can reproduce again.
What happened?
It appears that ibis.sample is returning the whole dataframe and ignoring the fraction parameter.
I've also tested with a
to_parquet('test_data.parquet')
call and that saves a full copy of the dataset unsampled.ipython test:
What version of ibis are you using?
9.5.0
What backend(s) are you using, if any?
Duckdb with s3fs filesystem.
Environment is setup with this pre-script (env vars store S3 variables):
Relevant log output
Code of Conduct