pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

Projection pushdown not working for AnonymousScan when filtering on calculated column #17131

Open datapythonista opened 4 months ago

datapythonista commented 4 months ago

Checks

Reproducible example

#[test]
fn scan_anonymous_fn_with_options() -> PolarsResult<()> {
    let function = Arc::new(|scan_opts: AnonymousScanArgs| {
        assert_eq!(scan_opts.n_rows, Some(3));
        assert_ne!(scan_opts.with_columns, None);
        Ok(fruits_cars())
    });

    let args = ScanArgsAnonymous {
        schema: Some(Arc::new(fruits_cars().schema())),
        ..ScanArgsAnonymous::default()
    };

    let df = LazyFrame::anonymous_scan(function, args)?
        .select([col("A"), col("fruits")])
        .fetch(3)?;

    assert_eq!(df.shape(), (3, 2));
    Ok(())
}

Log output

No response

Issue description

Not sure if I'm missing something, but looks like projection pushdown is not working for AnonymousScan.

Besides the provided failing test in #17130, I tried implementing a struct with the AnonymousScan trait, and specifying:

fn allows_projection_pushdown(&self) -> bool {
    true
}

But when the scan function is called by Polars, I'd expect AnonymousScanArgs.with_columns to contain the needed columns from select, but I receive None instead.

AnonymousScanArgs.n_rows seems to be correctly receiving the value from .fetch(), so the problem seems specific to with_columns.

Expected behavior

I'd expect the AnonymousScanArgs passed to my scan function in my AnonymousScan trait implementation to contain the projection with the required columns only, not None.

Installed versions

cargo test --all-features
ritchie46 commented 4 months ago

It seems to work. You must ensure to implement the trait and use that trait object as anonymous scan.

#[test]
fn scan_anonymous_fn_with_options() -> PolarsResult<()> {
    struct MyScan {}

    impl AnonymousScan for MyScan {
        fn as_any(&self) -> &dyn Any {
            self
        }

        fn allows_projection_pushdown(&self) -> bool {
            true
        }

        fn scan(&self, scan_opts: AnonymousScanArgs) -> PolarsResult<DataFrame> {
            assert_ne!(scan_opts.with_columns, None);
            assert_ne!(scan_opts.n_rows, None);
            let out = fruits_cars().select(scan_opts.with_columns.unwrap().as_ref())?;
            Ok(out.slice(0, scan_opts.n_rows.unwrap()))
        }
    }

    let function = Arc::new( MyScan{});

    let args = ScanArgsAnonymous {
        schema: Some(Arc::new(fruits_cars().schema())),
        ..ScanArgsAnonymous::default()
    };

    let q = LazyFrame::anonymous_scan(function, args)?
        .select([col("A"), col("fruits")])
        .limit(3);

    let df = q.collect()?;

    assert_eq!(df.shape(), (3, 2));
    Ok(())
}

Note that fetch doesn't do a slice pushdown. Fetch does not lead to correct queries, but just limits a scan to only produce n rows. (though anonymous scans) don't have to respect it. Fetch is only intended for debug purposes.

datapythonista commented 4 months ago

Interesting. Seems like it's a bit more complex than I thought. This seems to be failing only when using with_columns, and I can reproduce in my project with 0.40, but not in main. I assumed to quickly projection pushdown was always failing for AnonymousScan, sorry about that. I'll check again in my project when 0.41 is working, as I think even when using with_columns it's fixed now.

I'll open a PR with your test, I think it should be useful to have it in the test suite. Thanks for the help with this!

ritchie46 commented 4 months ago

0.41.2 is released. Can we close this one?

datapythonista commented 4 months ago

I had another look, seems like the problem is filtering by a calculated column. The test we have now in the test suite fails with this pipeline:

   let q = LazyFrame::anonymous_scan(function, args)?    
        .with_column((col("A") * lit(2)).alias("A2"))    
        .filter(col("A2").lt(lit(6)))    // <- ADDED THIS LINE
        .select([col("A2"), col("fruits")])    
        .limit(3);