nmandery / h3ronpy

A data science toolkit for the H3 geospatial grid
62 stars 6 forks source link

Question: Lazy coordinates_to_cells #42

Open highway900 opened 7 months ago

highway900 commented 7 months ago

I am a new polars user and I am curious how do I use the coordinates_to_cells function in a lazy context?

If I do what I think needs to be done I get an error TypeError: 'Expr' object is not iterable I can achieve my goal in the eager way. But hoping I can do this with the lazy api?

import polars as pl
from h3ronpy.polars.vector import coordinates_to_cells

# Sample Polars DataFrame with latitude and longitude
data = {
    "x": [-74.0060, -118.2437, -87.6298],  # 'x' for longitude
    "y": [40.7128, 34.0522, 41.8781],  # 'y' for latitude
}

res = 8
df = (
    pl.DataFrame(data)
    .lazy()
    .with_columns(
        coordinates_to_cells(pl.col("x"), pl.col("y"), resarray=res)
        .h3.cells_to_string()
        .alias(f"h3_{res}")
    )
)
nmandery commented 7 months ago

Whats required for this is being able to call coordinates_to_cells directly on a polars expression Expr. We are already providing polars expressions in https://github.com/nmandery/h3ronpy/blob/ca891fa5dfa8e1ea4dd7006d15d31bd294a45a2a/python/h3ronpy/polars/__init__.py#L57C7-L57C7 , but not for this functionality. The problem here is that this function requires at minimum two series as input and I do not see how this can be achieved using https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.api.register_expr_namespace.html#polars.api.register_expr_namespace . Polars expressions seem to operate only one single series. Please correct me if that is not the case - I am not up-to-date with the most recent versions of polars.

What could be done is implementing an extension of a LazyFrame (https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.api.register_lazyframe_namespace.html), but I am not sure about how useful this would be. It would only allow calling the function directly on lazyframes, not from within expressions.

highway900 commented 7 months ago

Thanks for looking at this, I was mostly looking at using a Lazyframe and not explicitly using expressions. I will have poke around with register_lazyframe_namespace. I think though you answered my query which was this currently isn't possible so it's not just my lack of experience with polars being the problem :)

BielStela commented 3 weeks ago

Hi ^^. In order to take multiple args it could be implemented using the polars plugin system like in https://marcogorelli.github.io/polars-plugins-tutorial/lost_in_space/. Something in the line of

#[polars_expr(output_type = UInt64)]
fn coordinates_to_cells(inputs: &[Series], kwargs: H3Kwargs) -> PolarsResult<Series> {
    let lats = inputs[0].f64()?;
    let lons = inputs[1].f64()?;
    let resolution = Resolution::try_from(kwargs.resolution).unwrap();

    let mut cells: Vec<u64> = Vec::with_capacity(lats.len());

    lats.iter().zip(lons.iter()).for_each(|(lat, lon)| {
        if let (Some(lat), Some(lon)) = (lat, lon) {
            cells.push(u64::from(LatLng::new(lat, lon).unwrap().to_cell(resolution)))
        }
    });

    Ok(UInt64Chunked::from_vec("cells", cells).into_series())
}

and then register the function in pythonland with:

import polars as pl
from polars.plugins import register_plugin_function
from polars.type_aliases import IntoExpr

def coordinates_to_cells(lat: IntoExpr, lon: IntoExpr,*, resolution: int) -> pl.Expr:
        return register_plugin_function(
            plugin_path=Path(__file__).parent,
            args=[lat, lon],
            kwargs={"resolution": resolution},
            function_name="coordinates_to_cells",
            is_elementwise=True,
        )

Would allow us to operate on the LazyFrame example as such

In [7]: df.collect()
Out[7]:
shape: (3, 2)
┌───────────┬─────────┐
│ x         ┆ y       │
│ ---       ┆ ---     │
│ f64       ┆ f64     │
╞═══════════╪═════════╡
│ -74.006   ┆ 40.7128 │
│ -118.2437 ┆ 34.0522 │
│ -87.6298  ┆ 41.8781 │
└───────────┴─────────┘

In [8]: df.select(cells=coordinates_to_cells("x", "y", resolution=8)).collect()
Out[8]:
shape: (3, 1)
┌────────────────────┐
│ cells              │
│ ---                │
│ u64                │
╞════════════════════╡
│ 616717907826573311 │
│ 616483633261182975 │
│ 616736054719807487 │
└────────────────────┘

However, this needs a custom plugin in rust land which needs to be build as a polars plugin :/

I guess it can be done using the existing h3ronpy function coordinates_to_cells and doing some black magic with pl.Expr to exctract the series from the multicolumn expression like

df.select(cell = pl.col("x", "y").h3.coordinates_to_cells(resolution=8))

But the documentation falls short and I did not find anything similar in the wilderness