pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

Implement/Expose iter_rows in Rust #10811

Open rollo-b2c2 opened 1 year ago

rollo-b2c2 commented 1 year ago

Problem description

The only way I can see to iter rows in Rust is this little monstrosity.

        let row_wise = (0..dataframe.height())
            .map(|x| dataframe.get_row(x).unwrap())

It would just make a lot of sense to have iter rows in Rust. I get it's not how Arrow likes to be accessed. But there's a lot of usecases for it. Lets say you're running an aggregation on a server that outputs a small (under 100row) table, being able to send row wise results just makes it easier to integrate polars into existing APIs that might not be columnar.

ritchie46 commented 1 year ago

We don't expose that functionality as that is not the way polars should be used. We never have needed it ourselves internally.

Even on small data, the idiomatic use case of polars is using the lazy API and create your queries with the DSL.

rollo-b2c2 commented 1 year ago

I've used Polaris to do an aggregation of data. This returns a small dataframe. There's an existing UI API that wants Rows of data not columns.

What is the idiomatic way of giving the data back to the user?

rollo-b2c2 commented 1 year ago

not the way polars should be used

Are you saying Polars should not be used to aggregate data which is sent to a UI in a row-wise format? 😕 Like I'd like to add Polaris into my stack, but I'm not designing a project around it, I'm integrating it into one.

rollo-b2c2 commented 1 year ago

https://stackoverflow.com/questions/72440403/iterate-over-rows-polars-rust

Like I'm not the only person who'd find this useful (everyone using this project would be aware that it's slower than saving to arrow, but the world doesn't run on Arrow.)

trueb2 commented 1 year ago

I find it useful to keep a small crate of functions that hide some of the verbose APIs. The API hasn't changed much in the last year in my experience.

For example, when I need to iterate over rows to do some math and generate multiple columns, I would use the map/map_many interface and maybe flatten a struct column. Sometimes not just interface, but math requires row-wise iteration, such as with digital filtering. The user defined functions still allow access to the data in row order efficiently.

I don't think digital filtering belongs supported within Polars, and Polars doesn't prevent implementing it efficiently. Ergonomic row-wise iteration seems to me to be in the same category because Polars intentionally doesn't implement it efficiently, even though there is need for row-wise iteration.

Example filtering or map_many math ```rust pub fn lowpass( df: LazyFrame, cutoff: f64, hz: f64, cols: &[&str], ) -> AResult { let df = df.with_columns( &cols .iter() .map(|c| { col(c) .cast(DataType::Float32) .map(pl_lowpass_filt_fn(cutoff, hz), Default::default()) .alias(name) }) .collect_vec(), ); Ok(df) } pub fn pl_lowpass_filt_fn( cutoff: f64, hz: f64, ) -> impl Fn(Series) -> Result, PolarsError> { move |s: Series| -> Result, PolarsError> { let f = s.f32()?.into_no_null_iter().map(|f| f as f64); let sos = butter_filter_lowpass(Some(4), cutoff, hz); let bp = zero_phase_filter(&sos, f).map(|f| f as f32).collect_vec(); Ok(Some(Float32Chunked::from_vec(s.name(), bp).into())) } } ``` ```rust pub fn compute_hrv_from_beats_and_sounds( df: LazyFrame, time: &str, sounds: &str, beats: &str, hz: f64, min_bpm: f64, max_bpm: f64, hr_window_s: f32, ... other params ... ) -> super::AResult { let df = df .with_column( col(time) .map_many( pl_compute_hrv_fn( hz, min_bpm, max_bpm, hr_window_s as f64, ), &[col(sounds), col(beats)], Default::default(), ) .alias("hrv"), ) .with_columns(&[ col("hrv") .map( |s| { let s = s.struct_()?.field_by_name("hr")?; Ok(Some(s)) }, Default::default(), ) .alias("hr"), col("hrv") .map( |s| { let s = s.struct_()?.field_by_name("nni")?; Ok(Some(s)) }, Default::default(), ) .alias("nni"), ... } pub fn pl_compute_hrv_fn( hz: f64, min_bpm: f64, max_bpm: f64, hr_window_s: f64, ) -> impl Fn(&mut [Series]) -> Result, PolarsError> { move |s: &mut [Series]| -> Result, PolarsError> { // Compute various HRV measures as ... let t = s[0].datetime()?.into_iter(); let sound_pks = s[1].f32()?.into_iter(); let beat_pks = s[2].f32()?.into_iter(); ...math... let mut hrv_rmssd = hrv_rmssd.collect::(); hrv_rmssd.rename("hrv_rmssd"); let struct_ca = StructChunked::new( "hrv", &[ std::mem::take(&mut s[0]), hr.into(), nni.into(), hrv_rmssd.into(), ... various other chunked arrays ... ], )?; Ok(Some(struct_ca.into_series())) } ```