thinkingmachines / geomancer

Automated feature engineering for geospatial data
MIT License
214 stars 16 forks source link

Add functionality for multiple `on` parameters for spells #67

Open ardieorden opened 5 years ago

ardieorden commented 5 years ago

Right now, a user can only input one on parameter to any of the Spells.

dist_primary = DistanceToNearest(
    on="primary",
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary")

The on parameter is equivalent to a WHERE clause in SQL.

WHERE fclass = 'primary'

There might be cases wherein multiple on parameters are needed for feature engineering and it'd be useful to have that option available out of the box.

dist_primary_or_secondary = DistanceToNearest(
    on=["primary", "secondary"],
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary_or_secondary")
WHERE fclass = 'primary' OR fclass = 'secondary'
ljvmiranda921 commented 5 years ago

@ardieorden if we have OR and AND in our SQL statements, how would you imagine its usage?

ardieorden commented 5 years ago

Right now, on refers solely to the fclass. In this context, I don't think we need to worry about the AND statement since the fclass is mutually exclusive.

If you make a query on Geofabrik data with a WHERE clause that says

WHERE fclass = 'primary' AND fclass = 'secondary'

then you wouldn't get any results because a road that has a primary fclass can't also have a secondary fclass at the same time.


But if on can refer to other columns aside from fclass and if it can accept logical and comparison operators, then I would think it would be nice to be able to do something like this:

df = DistanceToNearest(
    on= ["fclass", "is", "not", "primary", "or", "fclass", "is", "secondary", "and", " distance", "less than", "1000"]
    source_table="geospatial.ph_osm.gis_osm_roads_free_1",
    feature_name="dist_primary_or_secondary")
WHERE fclass != 'primary' OR fclass = 'secondary' AND distance < 1000

The logical and comparison operators can be reserved keywords for the on argument. Anything that's not a reserved keyword is assumed to be a field name.

I'm not sure how usable this feature will be for our current feature engineering work and it seems quite complicated. I'm ok with just implementing the OR logical operator for the on argument for now. What do you think @ljvmiranda921 ?

ljvmiranda921 commented 5 years ago

Parameter on can now use columns aside from fclass via the {column}:{filter} format. So I was thinking of something like:

bridge:T AND fclass:primary

Your idea of passing iterables is fine. To incorporate the operators I think we can follow something similar to SQLAlchemy, an or(*args) and and(*args) method: https://stackoverflow.com/questions/7942547/using-or-in-sqlalchemy.

import geomancer.filters as f

DistanceToNearest(
    on = f.and("bridge:T", "primary"),  # default is fclass if no filter
    **kwargs
)

Can also do chaining:

f.or("secondary", f.and("bridge:T", "primary"))
ardieorden commented 5 years ago

Cool! I like your suggestion much better.

ardieorden commented 5 years ago

@pberba Here's the issue.