ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

feat(snowflake): implement Table.sample #9071

Closed IndexSeek closed 1 week ago

IndexSeek commented 2 weeks ago

This PR aims to add support for SAMPLE/TABLESAMPLE for Snowflake to build on the work done in #7377.

As the SAMPLE/TABLESAMPLE clause seems to be written the same way between Snowflake and DuckDB in SQL, I was able reuse code from: https://github.com/ibis-project/ibis/blob/b48f451de77bd9af832cea10d7ddf8c44097c45a/ibis/backends/duckdb/compiler.py#L118-L128

Snowflake Docs: SAMPLE/TABLESAMPLE

Code Snippet I used to validate it was working:

import ibis

con = ibis.snowflake.connect(create_object_udfs=False)

t = con.table("WOWAH_DATA_RAW", database=("WOWAH", "PUBLIC"))

print(t.count().execute())

for expression in (
    t.sample(fraction=0.01),
    t.sample(fraction=0.01, seed=42),
    t.sample(fraction=0.01, method="block", seed=42),
):
    print(ibis.to_sql(expression, dialect="snowflake"))
    print(expression.count().execute())

Results:

10826734
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE bernoulli (1.0)
107897
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE bernoulli (1.0) SEED (42)
107953
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE system (1.0) SEED (42)
107953
IndexSeek commented 2 weeks ago

A few thoughts/questions I have on this:

Should we wait to move this logic to the SQLGlotCompiler?

Should we break apart the first sample test? It made it past the first assertion but failed on df = t.sample(0.1, method="block").execute() due to this error: SAMPLE clause on views only supports row wise sampling without seed. I marked that to be skipped in the interim, but thinking we would want to cover the first scenario at least or test it without a view.

CC: @cromano8 as we were discussing this behavior.

cromano8 commented 2 weeks ago

Definitely going to check this out tomorrow/wednesday

IndexSeek commented 1 week ago

I am marking this as a draft for the time being; I haven't been able to dig back in yet.

I'm currently working on setting things up to properly test locally against this backend rather than relying on the cloud-ci-runs. This should make future PRs a little smoother for me as well.

cpcloud commented 1 week ago

@IndexSeek I can help here, there's only two kinds of failures and I don't think there's anything else to do here:

  1. XPASS: these are tests that fail when they start succeeding. Have xfails in strict mode so that when tests that were previously failing start succeeding we know.
  2. Transient BigQuery errors: nothing to do here for the moment except restart the test, which I can do.
cpcloud commented 1 week ago

Snowflake sample tests are passing:

cloud in 🌐 falcon in …/ibis on ξ‚  feat/snowflake-sample is πŸ“¦ v9.0.0 via 🐍 v3.12.2 via ❄️   impure (ibis-3.12.2-env) took 20s
❯ pytest 'ibis/backends/tests/test_generic.py' -k sample -m snowflake -vv
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.12.2, pytest-8.2.0, pluggy-1.5.0 -- /nix/store/cdmvsa912j1xj2d6kfsv0b63ffnylkvv-python3-3.12.2-env/bin/python3.12
cachedir: .pytest_cache
hypothesis profile 'dev' -> deadline=None, max_examples=50, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase(PosixPath('/home/cloud/ibis/.hypothesis/examples'))
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=1209062198
rootdir: /home/cloud/ibis
configfile: pyproject.toml
plugins: hypothesis-6.100.2, benchmark-4.0.0, pytest_httpserver-1.0.10, clarity-1.0.1, randomly-3.15.0, repeat-0.9.3, timeout-2.3.1, xdist-3.6.1, cov-5.0.0, anyio-4.3.0, mock-3.14.0, snapshot-0.9.0
collected 3660 items / 3656 deselected / 4 selected

ibis/backends/tests/test_generic.py::test_sample[snowflake-block] XFAIL (SAMPLE clause on views only supports row wise sampling without seed.) [ 25%]
ibis/backends/tests/test_generic.py::test_sample_memtable[snowflake] PASSED [ 50%]
ibis/backends/tests/test_generic.py::test_sample[snowflake-row] PASSED   [ 75%]
ibis/backends/tests/test_generic.py::test_sample_with_seed[snowflake] PASSED [100%]
IndexSeek commented 1 week ago

That's awesome; thank you so much for helping me get this across the finish line!

  1. XPASS: these are tests that fail when they start succeeding. Have xfails in strict mode so that when tests that were previously failing start succeeding we know.

Ahh, that makes sense! I was digging around to see if I had forgotten to remove it/add it from/to an iterable somewhere.