pelias / csv-importer

Import arbitrary data in CSV format to Pelias
MIT License
23 stars 22 forks source link

Including city name in forward geocoding text search not working as expected. #107

Open gagandeepsingh1105 opened 5 months ago

gagandeepsingh1105 commented 5 months ago

Hi there,

I am an engineer at Public Health Agency of Canada. We currently have a use case for which we are looking to deploy an instance Pelias Geocoder. For this use case, we have some custom input data(a csv file) of Canada locations only and we want to use Pelias Geocoder's forward geocoding to convert the text address to longitudes and latitudes. And for this reason we are trying to deploy csv-importer. Below is the snapshot of input data that we have ingested into our elastic search instance: image

While using forward geocoding if we supply street number, street name and province , then the api returns the response with confidence level =1 and source =custom:

Api request: https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom image

But if we also include the city name in the input text, then the confidence level drops to 0.6 and the match type changes to fall back. As you may have already noted that we do have a column named 'city' in our input data but somehow csv-importer is not able to read it and falls back to whosonfirst data source.

We have tried a couple of things at our end to resolve this issue: 1) In the pelias.json configuration file , we added a "docs" key to map the columns in the csv file with those in pelias schema but got the following error:

image

Snapshot of pelias.json file: "csv": { "datapath": "/data/csv-importer-files", "files": ["NLFD_test_changed.csv"], "docs": [ { "name": "LAT", "type": "number", "required": true }, { "name": "LON", "type": "number", "required": true }, { "name": "SOURCE", "type": "number", "required": true }, { "name": "LAYER", "type": "number", "required": true }, { "name": "NUMBER", "type": "string", "required": false, "es_field": "address.number" }, { "name": "STREET", "type": "string", "required": false, "es_field": "address.street" }, { "name": "CITY", "type": "string", "required": false, "es_field": "address.city" }, { "name": "NAME", "type": "string", "required": false, "es_field": "address.name" }, { "name": "MAIL_PROV_ABVN", "type": "string", "required": false, "es_field": "address.region" }, { "name": "POSTALCODE", "type": "string", "required": false, "es_field": "address.postalcode" } ], "download": [] }

2) Also, tried to give the column mapping in a separate file but that too didn't work and got the same error again

image

Snapshot of pelias.json file { "imports": { "csv": { "datapath": "/data", "files": [ "canada-locations.csv" ], "mappings": "/code/csv_mapping.json" } } }

and then defined the column mappings in a separate file: { "mappings": { "id": "id", "latitude": "latitude", "longitude": "longitude", "number": "house_number", "street": "street", "city": "city", "region": "region", "province": "province", "country": "country", "postalcode": "postalcode", "category": "category", "name": "name", "layer": "address" } }

Steps to Reproduce 1) Deploy an instance of Pelias Geocoder with csv-importer running 2) Make the above mentioned configuration changes in pelias.json file. 3) Try the following Api calls: https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr st john's nl"&sources=custom

Expected behavior Including city name in the search text should also give confidence=1 and source=custom

Environment (please complete the following information): We are currently running an instance of Pelias Geocoder on a kubernetes cluster on Google Cloud Platform

Please do let us know in case you require any additional information to debug this issue. Thanks in advance.

missinglink commented 5 months ago

Hi @gagandeepsingh1105, the 'administrative hierarchy' (ie. the city/province/country) of each record in Pelias is sourced exclusively from the WhosOnFirst dataset through point-in-polygon lookups at index time.

missinglink commented 5 months ago

I believe this is a duplicate of https://github.com/pelias/csv-importer/issues/74

missinglink commented 5 months ago

I'm not against adding this option to custom builds, the issue is that currently all administrative regions are composed of a source, id and term (with an optional abbreviation).

We could use 'custom' as the source, but each admin region would need to have a unique id in order to correctly generate the _gid field.

An autoincrement value could work here but would have the disadvantage that two places in the same area would have differing parent IDs.

missinglink commented 5 months ago

It's possible to have multiple associated 'parents' for a single layer, so for example a record can have multiple 'region' records associated.

The issue would be that we only return one (ie. the first one), so it would either need to be decided (or configurable) whether the record from the CSV file was returned, or the WOF one, in the case where both data sources returned a match.

the-epeecurean commented 5 months ago

Hello,

I am a developer on the original poster's team. I think this is an issue of how WOF is passed back as the first record returned, or how readily it is searched for a 'fallback' match, if a locality name is present despite a focus on a more granular location.

I performed the same two searches in the original post excluding the "sources=custom" filter from the API call and encountered the same behaviour. A search for "283 Prince Philip dr NL" (https://geocoder.alpha.phac.gc.ca/api/search?text="283%20prince%20philip%20dr%20NL") resulted in a match from the custom source with confidence 1.0.

However, a search for "283 Prince Philip dr St. John's NL" results in a match from WOF, and seemingly ignores a filter on the address layer type: https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22 OR https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address

We'd like to use the custom data source in performing batch forward geocoding, and it is useful to pass an 'address, city, province' search term where the inclusion of the city helps refine the search. As identified in the original issue, this does not appear to be what is happening due to the inclusion of the city name.

We understand that WOF is the exclusive source for administrative hierarchy in Pelias, but the inclusion of the place name shouldn't cue the fallback behaviour when an accurate match to the desired layer granularity (street address) is available. In this scenario a street address supplemented by a city name should refine the area for a search, but it seems that it prompts a fallback match instead. It also seems to ignore a layer search filter in the API call when the city name is included, triggering the returned fallback result from WOF.

Thank you for your help!

missinglink commented 5 months ago

The debug query param displays a bunch more info: https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address&debug=1

You can see that the Placeholder service ran, it found a matching locality:

{
  "controller:placeholder": [
    {
      "id": 890456615,
      "name": "St. John's",
      "placetype": "locality",
      "population": 99182,
      "lineage": [
        {
          "country": {
            "id": 85633041,
            "name": "Canada",
            "abbr": "CAN",
            "languageDefaulted": false
          },
          "county": {
            "id": 1158869009,
            "name": "Division No. 1",
            "languageDefaulted": false
          },
          "locality": {
            "id": 890456615,
            "name": "St. John's",
            "languageDefaulted": false
          },
          "region": {
            "id": 85682123,
            "name": "Newfoundland and Labrador",
            "abbr": "NL",
            "languageDefaulted": false
          }
        }
      ],
      "geom": {
        "bbox": "-52.72931,47.54494,-52.68931,47.58494",
        "lat": 47.56494,
        "lon": -52.70931
      },
      "languageDefaulted": false
    }
  ]
}

Then when the Elasticsearch query is run, the ID of the locality matched above is added as a Filter condition (ie. mandatory condition):

{
  "filter": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "terms": {
            "parent.locality_id": [
              "890456615"
            ]
          }
        }
      ],
      "must": [
        {
          "terms": {
            "layer": [
              "address"
            ]
          }
        }
      ]
    }
  }
}

Of course this results in 0 hits:

{
  "controller:search": {
    "queryType": {
      "address_search_using_ids": {
        "es_took": 36,
        "response_time": 42,
        "retries": 0,
        "es_hits": 0,
        "es_result_count": 0
      }
    }
  }
}

At this point there are zero matches, I forget the exact workflow here but I believe it falls back to a legacy search method which was more lenient.

I don't like that the request specifies only address layers but returns other layers, this is likely a bug, but one which doesn't often occur outside of custom installations such as this.

missinglink commented 5 months ago

The geometry of 890456615 St. John's is of type Point, which explains why the address wasn't associated via the PIP service. (the address must lie inside the boundary)

missinglink commented 5 months ago

Maybe for your usecase you can disable the Placeholder service, or possibly don't add any data to it? I haven't tested it, but it might prevent the filter condition being added to the elasticsearch query, which sounds like what you want.

missinglink commented 5 months ago

@the-epeecurean are there better open geo data for that region?

the only one I can find is points only, does the CA govt publish something better than this? https://opendata.gov.nl.ca/public/opendata/page/?page-id=datasetdetails&id=265

the-epeecurean commented 5 months ago

@missinglink There are ... Statistics Canada publishes a hierarchy of delineated boundaries. I've just been evaluating some cherry-picked WOF 'fallback' results we've been seeing in testing.

Here's a link to an open REST point for the collected Cartographic Boundary files published by Statistics Canada: https://geo.statcan.gc.ca/geo_wa/rest/services/2021/Cartographic_boundary_files/MapServer

And a reference to descriptions of the Cartographic Boundary files made available (at the bottom under "1. Spatial information products"): https://www150.statcan.gc.ca/n1/pub/92-196-x/92-196-x2021001-eng.htm

A polygon for the example cited in the Issue above (St. John's NL) appears at the CSD (census subdivision) and CMA (census metropolitan area) levels. However, some smaller localities (within a larger CMA, e.g., Halifax, NS) show up as polygons in the DPL (designated place) boundary file.

If there is any way that we could help in facilitating this spatial information being included in WOF, please let us know. It would help our usecase greatly to see a broader capture of localities in Canada represented as polygons.

nvkelso commented 5 months ago

Adding an issue upstream in Who's On First to help facilitate this work:

tl;dr the new 2021 cartographic boundary files from Stats Canada look great and we'd love to import them!