pelias / docker

Run the Pelias geocoder in docker containers, including example projects.
MIT License
330 stars 223 forks source link

OpenAddress data missing in local Pelias build #290

Closed JosephKuchar closed 2 years ago

JosephKuchar commented 2 years ago

Describe the bug So I've recently completed a Canada-specific local implementation of Pelias. To do this I've collected all of the Canadian OpenAddresses data and stored it in DATA_DIR/openaddresses, and the pelias import oa step seems to have run successfully (at least, it creates data streams for all the sources and doesn't produce any errors). However, there is data that I know to be in openaddresses that Pelias is not finding. If, for example, I query "100 Ilsley Avenue, Dartmouth, NS", then this won't be returned by Pelias. This address actually is in the provincial level Nova Scotia data in OpenAddresses: 299165,15c910dcb88ee5d9,100,Ilsley Ave,,Dartmouth,Halifax County,,,,44.699492,-63.587249

Interestingly, what can be found is the same street address but with no city specified.

I've tested other addresses that I've pulled sort of at random out of the openaddresses csvs, and Pelias returns fallback-type or interpolation matches from OSM instead of giving the exact matches that it theoretically should have available.

This is a sample of the output from running the data import step:

info: [wof-pip-service:master] starting with layers neighbourhood,borough,locality,localadmin,county,macrocounty,macroregion,region,dependency,country,empire,continent
info: [openaddresses] Importing 165 files.
info: [openaddresses] Creating read stream for: /data/openaddresses/nb_city_of_moncton.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/on_northumberland.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/on_oshawa.csv
info: [wof-pip-service:master] borough worker loaded 40 features in 1.876 seconds
info: [wof-pip-service:master] localadmin worker loaded 0 features in 1.821 seconds
info: [wof-pip-service:master] macrocounty worker loaded 0 features in 1.813 seconds
info: [wof-pip-service:master] macroregion worker loaded 0 features in 1.792 seconds
info: [wof-pip-service:master] dependency worker loaded 0 features in 1.847 seconds
info: [wof-pip-service:master] empire worker loaded 0 features in 1.842 seconds
info: [wof-pip-service:master] continent worker loaded 0 features in 1.784 seconds
info: [wof-pip-service:master] country worker loaded 1 features in 1.996 seconds
info: [wof-pip-service:master] region worker loaded 13 features in 2.208 seconds
info: [wof-pip-service:master] neighbourhood worker loaded 1825 features in 2.498 seconds
info: [openaddresses] Creating read stream for: /data/openaddresses/ab_city_of_red_deer.csv
info: [wof-pip-service:master] county worker loaded 359 features in 4.014 seconds
info: [openaddresses] Creating read stream for: /data/openaddresses/bc_city_of_new_westminster.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/on_dufferin.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/nb_city_of_st_john.csv
...
info: [openaddresses] Creating read stream for: /data/openaddresses/on_greater_sudbury.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/on_township_of_durham.csv
info: [openaddresses] Creating read stream for: /data/openaddresses/ab_calgary.csv
info: [wof-admin-lookup] Shutting down admin lookup service
info: [admin-lookup:worker] continent worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] macroregion worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] macrocounty worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] borough worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] empire worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] region worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] dependency worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] country worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] county worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] localadmin worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] neighbourhood worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] locality worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [openaddresses] Total time taken: 129.326s

I appreciate any help!

missinglink commented 2 years ago

Please post your pelias.json

missinglink commented 2 years ago

The admin info from OA is being discarded, we assign a consistent hierarchy with GIDs using point-in-polygon lookups during import (pip-service).

The final log lines you posted show that the PIP service failed to assign any admin info.

This could be for several reasons, I'd need to see your config to confirm.

Also worth spot checking the lat and lon values from OA are in the correct order, they've had bugs in the past with that.

missinglink commented 2 years ago

You may also find the compare app useful for debugging and sharing queries: https://pelias.github.io/compare/#/v1/autocomplete?text=100+Ilsley+Avenue%2C+NS%2C+Canada

JosephKuchar commented 2 years ago

Thanks for the quick response! I've pasted the contents of the config file below (I cut out most of the lines of OA data, since there are about 150 of them). I had modified the Portland project to construct this one - I notice one thing I neglected to change is the focus point, is that relevant here?

{
  "logger": {
    "level": "info",
    "timestamp": false
  },
  "esclient": {
    "apiVersion": "7.5",
    "hosts": [
      { "host": "elasticsearch" }
    ]
  },
  "elasticsearch": {
    "settings": {
      "index": {
        "refresh_interval": "10s",
        "number_of_replicas": "0",
        "number_of_shards": "1"
      }
    }
  },
  "acceptance-tests": {
    "endpoints": {
      "docker": "http://api:4000/v1/"
    }
  },
  "api": {
    "services": {
      "placeholder": { "url": "http://placeholder:4100" },
      "pip": { "url": "http://pip:4200" },
      "interpolation": { "url": "http://interpolation:4300" },
      "libpostal": { "url": "http://libpostal:4400" }
    },
    "defaultParameters": {
      "focus.point.lat": 45.52,
      "focus.point.lon": -122.67
    }
  },
  "imports": {
    "adminLookup": {
      "enabled": true
    },
    "blacklist": {
      "files": [
        "/data/blacklist/osm.txt"
      ]
    },
    "csv": {
      "datapath": "/data/csv",
      "files": [],
      "download": [
        "https://raw.githubusercontent.com/pelias/csv-importer/master/data/example.csv"
      ]
    },
    "geonames": {
      "datapath": "/data/geonames",
      "countryCode": "CA"
    },
    "openstreetmap": {
      "download": [
        { "sourceURL": "https://download.geofabrik.de/north-america/canada-latest.osm.pbf" }
      ],
      "leveldbpath": "/tmp",
      "datapath": "/data/openstreetmap",
      "import": [{
        "filename": "canada-latest.osm.pbf"
      }]
    },
    "openaddresses": {
      "datapath": "/data/openaddresses",
      "files": [
"nb_city_of_moncton.csv",
"on_northumberland.csv",
"on_oshawa.csv",
...
"ab_calgary.csv"]
    },
    "polyline": {
      "datapath": "/data/polylines",
      "files": [ "extract.0sv" ]
    },
    "whosonfirst": {
      "datapath": "/data/whosonfirst",
      "importPostalcodes": true,
      "countryCode": "CA",
      "importPlace": [
        "85633041"
      ]
    }
  }
}
missinglink commented 2 years ago

That all looks fine, specifically imports.adminLookup.enabled=true.

You should definitely delete the api.defaultParameters section you copied from Portland completely, although that's tangental.

There's something weird going on here...

So in the log you posted I would expect to see a line saying locality worker loaded... with a decent number, I'm assuming that this line was present in the original log but truncated when you removed all the sources for brevity.

The main issue here is indicated by the {"calls":0,"hits":0,"misses":0} lines, these indicate that none of the admin polygons loaded spatially intersected with any of the OA rows.

This is very unusual, my first intuition was that the admin polygons aren't being loaded correctly, but I can see that you have region worker loaded 13 features so at very least we'd expect to see Provinces assigned.

So yeah, like I said before, it could be that the lat/lon values are funky in the OA data. There's two ways you can confirm this, firstly have a look at the GeoJSON response you're getting back, the Point geometry dimension order is [lon, lat], check this is correct.

The other way to check is to locate your data directory on the host (identified by the DATA_DIR env var) and go in there and into the OA directory and post the top ten lines of one of the OA files here, again I'm just sanity checking the lat/lon columns are defined the right way around.

missinglink commented 2 years ago

One other thing which I just noticed is how you're defining your OA sources.

In the Portland project they look like this "us/or/portland_metro.csv" but in yours they look like this "nb_city_of_moncton.csv", is that correct?

Shouldn't they look more like "ca/nb/city_of_moncton.csv"?

JosephKuchar commented 2 years ago

You're right, there is a locality worker loaded line I truncated,

...
info: [wof-pip-service:master] locality worker loaded 6172 features in 7.636 seconds
info: [wof-pip-service:master] PIP Service Loading Completed!!!
info: [openaddresses] Creating read stream for: /data/openaddresses/bc_city_of_courtenay.csv
...

The paths I specified are correct, I placed all the OA data into a single directory. Does Pelias expect a certain file structure? If so I can make it, but as is they're all in the same folder.

Maybe the path has been specified incorrectly? I just noticed that in the pelias configuration file I specify data/openaddresses, but the DATA_DIR is pelias-test/data/ - so is it interpreting it as data/data/openaddresses? But the same is true of every other data source, and they all seem to have been correctly imported.

The CSVs for open address data look fine to me, here's an excerpt below. Lat and lon are properly defined.

,hash,number,street,unit,city,district,region,postcode,id,lat,lon
0,7a715522c0e3c266,34,Armitage Crescent,N,Ajax,,,,,43.8806185,-79.0361826
1,904f3c3a5e3dc4e4,36,Armitage Crescent,N,Ajax,,,,,43.8806576,-79.0360722
2,f52fbb7802e7b967,40,Armitage Crescent,N,Ajax,,,,,43.880759,-79.0358346
JosephKuchar commented 2 years ago

Well, I tried specifying the directory as openaddresses instead of data/openaddresses, and that resulted in a directory not found error, so I don't think it's a path problem. I also tested putting one of the files into the standard open addresses format (ca/ab/calgary.csv), that didn't do anything either. It seems like the files are being read, but not being processed.

missinglink commented 2 years ago

The CSVs for open address data look fine to me,

What's that additional column on the left with no column header?

missinglink commented 2 years ago

What is the output of pelias elastic stats?

missinglink commented 2 years ago

Maybe the path has been specified incorrectly?

I suspect your paths are correct due to the log line info: [openaddresses] Importing 165 files.

JosephKuchar commented 2 years ago

The output from pelias elastic stats is

{
  "took" : 2188,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "sources" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [

        {
          "key" : "openstreetmap",
          "doc_count" : 14999792,
          "layers" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "street",
                "doc_count" : 9760995
              },
              {
                "key" : "address",
                "doc_count" : 4211212
              },
              {
                "key" : "venue",
                "doc_count" : 1027585
              }
            ]
          }
        },
        {
          "key" : "whosonfirst",
          "doc_count" : 836626,
          "layers" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "postalcode",
                "doc_count" : 809729
              },
              {
                "key" : "locality",
                "doc_count" : 23276
              },
              {
                "key" : "neighbourhood",
                "doc_count" : 2984
              },
              {
                "key" : "county",
                "doc_count" : 359
              },
              {
                "key" : "macrohood",
                "doc_count" : 119
              },
              {
                "key" : "localadmin",
                "doc_count" : 105
              },
              {
                "key" : "borough",
                "doc_count" : 40
              },
              {
                "key" : "region",
                "doc_count" : 13
              },
              {
                "key" : "country",
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "pelias",
          "doc_count" : 3,
          "layers" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "address",
                "doc_count" : 1
              },
              {
                "key" : "example_layer",
                "doc_count" : 1
              },
              {
                "key" : "with_custom_data",
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}
JosephKuchar commented 2 years ago

As for the leading column in the CSVs, that seems to be an artefact from using geopandas and pandas to convert the geojsons to CSVs and forgetting to set index=False when I wrote out the csvs. I'll try again after removing that column.

JosephKuchar commented 2 years ago

Update: Removing the pandas default index column didn't change anything. I notice in the elastic output that openaddresses isn't listed at all.

missinglink commented 2 years ago

It's quite frustrating trying to debug in a GitHub issue. Can you please clean up the code and open a draft PR to add a new Canada project, from there I can build your config and we can comment on the PR thread

JosephKuchar commented 2 years ago

Thanks for the help! I've actually solved it. This issue is closed, but I wanted to comment because this might apply to someone else in the future. I turned on the debugging option which gave me more info, and tested with just one file at a time. I saw that it was reading in the file, but skipping over all the lines, and giving these messages:

verbose: [openaddresses] number of invalid records skipped: 384170
info: [wof-admin-lookup] Shutting down admin lookup service
info: [wof-admin-lookup] Ensure your input file is valid before retrying

I looked into the codes in the pelias/openaddresses repo, and saw that all the column names it's referencing are capitalised, which is also my recollection of openaddress CSVs. However, it seems that the CSVs that were converted from geojsons didn't satisfy this. I just changed the column names to be capitalised, and now it seems to have run properly:

verbose: [openaddresses] number of invalid records skipped: 0
info: [wof-admin-lookup] Shutting down admin lookup service
info: [admin-lookup:worker] region worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] localadmin worker process exiting, stats: {"calls":1,"hits":0,"misses":1}
info: [admin-lookup:worker] borough worker process exiting, stats: {"calls":384170,"hits":0,"misses":384170}
info: [admin-lookup:worker] dependency worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] locality worker process exiting, stats: {"calls":384170,"hits":384169,"misses":1}
info: [admin-lookup:worker] continent worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] macrocounty worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] country worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] empire worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] neighbourhood worker process exiting, stats: {"calls":384170,"hits":382498,"misses":1672}
info: [admin-lookup:worker] macroregion worker process exiting, stats: {"calls":0,"hits":0,"misses":0}
info: [admin-lookup:worker] county worker process exiting, stats: {"calls":1,"hits":1,"misses":0}
info: [dbclient-openaddresses]  paused=false, transient=0, current_length=0, indexed=384170, batch_ok=769, batch_retries=0, failed_records=0, address=384170, persec=2467
info: [dbclient-openaddresses]  paused=false, transient=0, current_length=0, indexed=384170, batch_ok=769, batch_retries=0, failed_records=0, address=384170, persec=2467
info: [openaddresses] Total time taken: 79.107s

Thanks for your help!

missinglink commented 2 years ago

agh cool, glad you solved it ;)