pysal / pointpats

Planar Point Pattern Analysis in PySAL
https://pysal.org/pointpats/
BSD 3-Clause "New" or "Revised" License
80 stars 26 forks source link

geopandas-centric api #135

Open knaaptime opened 4 months ago

knaaptime commented 4 months ago

pointpats predates geopandas and was originally designed around (n,2) arrays of coordinates. It hasnt been updated much over time like the rest of the pysal stack, but today it's much more common to work in geodataframes rather than numpy arrays (even though you can reformat into the structure pointpats expects fairly easily). It might be nice to have something that consumes geodataframes/series and outputs the same.

e.g. instead of (in addition to?) the current weighted_mean_center which takes and returns arrays, i end up using something like

def attribute_weighted_center(gdf, column):
    if column not in gdf.columns:
        raise ValueError(
            f"`attribute` {column} was passed but is not present in the dataframe"
        )
    pt = Point(
        weighted_mean_center(
            df.centroid.get_coordinates()[["x", "y"]].values,
            df[column].fillna(0).values,
        )
    )
    return gpd.GeoSeries([pt], crs=gdf.crs)

would that be of general use? If so,

ljwolf commented 4 months ago

Yeah, this is tricky... I don't mind functions having multiple inputs using something like libpysal.graph._utils._validate_geometry_inputs() or something--I think that makes a lot of sense.

But, output type should be consistent imho... what about an "as_geom=None" option that set to True if the input is a geometry array/GeoSeries?

martinfleis commented 4 months ago

Yes please. It feels really weird when teaching pointpats that it, out of nowhere, expects arrays. The same applies to mgwr and spreg.

I don't like the from_gdf API very much. I'd prefer functions to be smart about the input and consume anything, be it array of coords or any array-like of shapely points.

Fully agree with @ljwolf that the output should be consistent, so I don't really understand the proposal there, which makes it controllable but inconsistent (dependent on the input type). as_geometry keyword or alike could be fine but I would not try to make some smart decisions about the return type. Make it a simple bool, initially defaulting to coords with deprecation and settling with Point later.

Another question is whether we want to return Point or GeoSeries, where I tend to prefer the former.

ljwolf commented 4 months ago

😅 I meant "output type should be consistent with input type"... I think if we have a function that accepts one of many possible input types, constructs a new object, and returns it, that output type should match the input type.

I think it's generally ok to have options that return the output type of a function, hence making the output type "inconsistent" across runs. In numpy/scipy code, this is often a return_ argument, like numpy.unique(array, return_counts=True). If we were to implement it as return_geometry=False, then we're in a relatively odd situation where the output type and input type are forced out of sync while we're deprecating, and it's not quite clear that we want to fully deprecate sending arrays to get back an array?

Hence, I thought return_geometry=None allows us to do the right thing more often than not, unless forced to do something specific. We're only supporting two types of input, and we have the code to do this in libpysal.graph already.

Maybe I have just been using R too much...

sjsrey commented 4 months ago

I think this is an important discussion, that perhaps should move to lib so that we ensure consistency across packages - if that makes sense? That would help with the surprises @martinfleis hits in pointpats - those surprises reflect the history of the packages - much of the early pysal packages were born pre pandas/geopandas, so it was numpy/numeric centric.

I'm also old enough to remember debates about whether things that we take for granted today (pandas, matplotlib, geopandas) where going to become standards or not. So the philosophy has been conservative (which is probably an understatement when it comes to changing apis ;-.> )

knaaptime commented 4 months ago

Hence, I thought return_geometry=None allows us to do the right thing more often than not, unless forced to do something specific. We're only supporting two types of input, and we have the code to do this in libpysal.graph already.

i find the return_ pattern pretty familiar, so i think i'd second Levi's proposal. In that case, we basically pass of to the prep function graph uses

Another question is whether we want to return Point or GeoSeries, where I tend to prefer the former.

the Point is probably more sensible (though the first thing im likely to do is wrap it in a geoseries and explore.. :) )

martinfleis commented 4 months ago

@ljwolf can you think of any example of a reduction (like weighted_mean_center would be), that returns different objects depending on the input? I know R is doing that but in Python, I can think only of numpy ufuncs which return a scalar in case of a reduction and always the same type of scalar, while for array-like output it usually preserves the type. I think that the logic return 2x1 array of coords if array is an input, otherwise return shapely.Point very strange an unintuitive within our ecosystem.

then we're in a relatively odd situation where the output type and input type are forced out of sync while we're deprecating

I am not sure where this comes from but given this is not a ufunc I don't see this an issue at all. Or to be more precise, I don't see the reason why input and output types shall be in sync for stuff like centrography. If I ignore what happens in R, I don't see a precedent, though that may just be my ignorance.

I think we should pick this discussion during the next PySAL dev call, to be more efficient.

ljwolf commented 4 months ago

strange and unintuitive within our ecosystem

I don't think it's that foreign, and we do something similar already in pointpats.geometry. But, fine to chat about this at the developer meeting. I would certainly like to avoid duplicating methods with from_gdf, and hope for a solution that avoids emitting numpy when shapely is provided.

I don't see a reason why input and output types shall be in sync for stuff like centrography.

I think it makes sense, if provided shapely-based geometries, to emit shapely-based geometries, especially if the change is motivated by, as @knaaptime puts it:

might be nice to have something that consumes geodataframes/series and outputs the same.