pelias / wof-admin-lookup

Who's on First Admin Lookup for the Pelias Geocoder
https://pelias.io
MIT License
9 stars 24 forks source link

RFC: optionally perform multiple PIP lookups per doc #300

Open missinglink opened 3 years ago

missinglink commented 3 years ago

Hi @orangejulius, @Joxit, @blackmad

I had a thought last night that we can fairly easily improve recall for queries where the user enters the name of a nearby parent, instead of the parent assigned by PIP. ie. they get the neighbourhood wrong.

We already have the postal cities mapping, which works well when a postcode is present. The postal cities mapping adds aliases to the parent field, so a record can have multiple 'locality' values, for instance.

We can extend on this further by performing multiple point-in-polygon lookups per document and recording each of the additionally matched parents as an alias.

I threw this PR together quickly, so it's not exactly what I would recommend merging, but I wanted to solicit feedback on the general idea, which is:

The wof-admin-lookup module would not be responsible for determining which additional points to use, we can update the importers accordingly to use this functionality as required, varying the amount of points based on geometry type and layer.

Below is a pretty picture I drew to illustrate how this might work for point, linestring and polygon geometry types, in each case the poorly draw pin is the centroid we're currently using and the crosshairs highlighted in yellow represent additional points we might lookup for aliases

IMG_20200909_095004_2

blackmad commented 3 years ago

I think from my perspective, I'm most interested in applying this to points, since it's fully qualified street addresses where we generally see postalcities/realestatecities issues. Two things I worry about for points are 1) we might end up adding a nontrivial number of new tokens 2) tuning the radius/sampling seems tough. But I'm open to helping out with this approach.

For lines, it seems helpful for the long roads + interpolation use case and maybe the best solution, though I think that probably having some distance sampling would be helpful.

For polygons, I'm struggling to figure out what kinds of queries this would help? It seems like the main use case there is searching for a neighborhood near the interface between two cities?

On Wed, Sep 9, 2020 at 4:13 AM Peter Johnson notifications@github.com wrote:

Hi @orangejulius https://github.com/orangejulius, @Joxit https://github.com/Joxit, @blackmad https://github.com/blackmad

I had a thought last night that we can fairly easily improve recall for queries where the user enters the name of a nearby parent, instead of the parent assigned by PIP. ie. they get the neighbourhood wrong.

We already have the postal cities mapping, which works well when a postcode is present. The postal cities mapping adds aliases to the parent field, so a record can have multiple 'neighbourhood' values, for instance.

We can extend on this further by performing multiple point-in-polygon lookups per document and recording each of the additionally matched parents as an alias.

I threw this PR together quickly, so it's not exactly what I would recommend merging, but I wanted to solicit feedback on the general idea, which is:

  • use the doc.getCentroid() for the primary parent info
  • if there is a 'meta' property specified with additional points, use results from those lookups for aliases of the parent.

The wof-admin-lookup module would not be responsible for determining which additional points to use, we can update the importers accordingly to use this functionality as required, varying the amount of points based on geometry type and layer.

  • I think we can begin with adding two additional points to the polyline importer, so we PIP the start and end points of a street additionally to the midpoint
  • We may also want to apply this logic to some of the "lower level" WOF records, such as neighbourhoods. We could provide four additional points at the corners of the bbox, or even extend this to the 8 compass directions.
  • Finally we may want to also apply this to points using a similar method, this would greatly improve the 'postal cities' and 'realestate cities' issues at the cost of significantly more PIP work.

Below is a pretty picture I drew to illustrate how this might work for point, linestring and polygon geometry types, in each case the poorly draw pin is the centroid we're currently using and the crosshairs highlighted in yellow represent additional points we might lookup for aliases

[image: IMG_20200909_095004_2] https://user-images.githubusercontent.com/738069/92572330-948b8d80-f284-11ea-808e-bfa53158c6fb.jpg

You can view, comment on, or merge this pull request online at:

https://github.com/pelias/wof-admin-lookup/pull/300 Commit Summary

  • feat(multi-pip): optionally perform multiple PIP lookups per doc

File Changes

Patch Links:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelias/wof-admin-lookup/pull/300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMEGFKTCZNQBWLWF2NTSE42ITANCNFSM4RBO2RPA .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

orangejulius commented 3 years ago

Interesting. This definitely makes the most sense IMO for roads (lines), since we have the current problem that we can only assign a single neighbourhood, borough, city, etc to the road where clearly it might equally belong to multiple. There would still be issues with display, but at least we can make search function better.

Somewhat related, and more complicated to implement, would be a PIP step at query time for interpolated results. This would prevent a street where the centroid is in City A from forcing all interpolated results to belong to that city, even if a large portion of the road (and therefore the addresses on the road) belong to nearby City B.

blackmad commented 3 years ago

Do we currently have a way in the index to add on secondary admin tokens that we won't accidentally use to construct labels/addreses?

On Wed, Sep 9, 2020 at 2:42 PM David Blackman whizziwig@gmail.com wrote:

I think from my perspective, I'm most interested in applying this to points, since it's fully qualified street addresses where we generally see postalcities/realestatecities issues. Two things I worry about for points are 1) we might end up adding a nontrivial number of new tokens 2) tuning the radius/sampling seems tough. But I'm open to helping out with this approach.

For lines, it seems helpful for the long roads + interpolation use case and maybe the best solution, though I think that probably having some distance sampling would be helpful.

For polygons, I'm struggling to figure out what kinds of queries this would help? It seems like the main use case there is searching for a neighborhood near the interface between two cities?

On Wed, Sep 9, 2020 at 4:13 AM Peter Johnson notifications@github.com wrote:

Hi @orangejulius https://github.com/orangejulius, @Joxit https://github.com/Joxit, @blackmad https://github.com/blackmad

I had a thought last night that we can fairly easily improve recall for queries where the user enters the name of a nearby parent, instead of the parent assigned by PIP. ie. they get the neighbourhood wrong.

We already have the postal cities mapping, which works well when a postcode is present. The postal cities mapping adds aliases to the parent field, so a record can have multiple 'neighbourhood' values, for instance.

We can extend on this further by performing multiple point-in-polygon lookups per document and recording each of the additionally matched parents as an alias.

I threw this PR together quickly, so it's not exactly what I would recommend merging, but I wanted to solicit feedback on the general idea, which is:

  • use the doc.getCentroid() for the primary parent info
  • if there is a 'meta' property specified with additional points, use results from those lookups for aliases of the parent.

The wof-admin-lookup module would not be responsible for determining which additional points to use, we can update the importers accordingly to use this functionality as required, varying the amount of points based on geometry type and layer.

  • I think we can begin with adding two additional points to the polyline importer, so we PIP the start and end points of a street additionally to the midpoint
  • We may also want to apply this logic to some of the "lower level" WOF records, such as neighbourhoods. We could provide four additional points at the corners of the bbox, or even extend this to the 8 compass directions.
  • Finally we may want to also apply this to points using a similar method, this would greatly improve the 'postal cities' and 'realestate cities' issues at the cost of significantly more PIP work.

Below is a pretty picture I drew to illustrate how this might work for point, linestring and polygon geometry types, in each case the poorly draw pin is the centroid we're currently using and the crosshairs highlighted in yellow represent additional points we might lookup for aliases

[image: IMG_20200909_095004_2] https://user-images.githubusercontent.com/738069/92572330-948b8d80-f284-11ea-808e-bfa53158c6fb.jpg

You can view, comment on, or merge this pull request online at:

https://github.com/pelias/wof-admin-lookup/pull/300 Commit Summary

  • feat(multi-pip): optionally perform multiple PIP lookups per doc

File Changes

Patch Links:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pelias/wof-admin-lookup/pull/300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMEGFKTCZNQBWLWF2NTSE42ITANCNFSM4RBO2RPA .

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

orangejulius commented 3 years ago

Yes, that already exists just fine, using the array-like nature of all Elasticsearch fields (just the same as we use for other aliases).

Code can call

doc.addParent(...)

multiple times and only the first will be used for display.

Joxit commented 3 years ago

Interesting, a few years ago (https://github.com/whosonfirst-data/whosonfirst-data/issues/1094#issue-308392038), I had had some problems with addresses assigned to the neighboring locality. I think this might fix this kind of issue (if it still exists). In my case we fixed this with data update.

What should be the distance from the original point to take ? Delta in degree ? Meter ?

This sound promising !