Closed shekoufa closed 4 years ago
That's interesting, the schema hasn't changed any time recently and the population of fields shouldn't change with the volume of data.
What is suapect is that you're importing different versions of WOF data between the builds and these differences are accounting for the changes.
It's also possible that the code has changed between builds but I looked and couldn't see anything which seemed related (we're working on something right now but it isn't merged yet).
Finally, it could be that your configurations are different, maybe you're running different versions of the docker containers or using different settings in pelias.json?
When you say performance are you referring to latency (cpu performance) or result quality?
In the future can you please paste your json blobs as pretty printed json. We're volunteering our time and it's very difficult to read a massive blob of text.
Thanks @missinglink for your answer. Now that I think the versions are definitely different between the two builds and the planet build is more recent. Also, you mentioned other reasons which could be contributing to this. I'm going to carefully try each one and see if it's to blame or not.
Sorry for the ambiguity but by performance, I meant result quality.
I am so sorry for pasting those jsons like that but I actually paste the pretty printed jsons inside "``" and when I post it (or hit preview to see it before posting) the editor ignores all the new lines.I gotta google this and see if there's a way to keep the new lines there!
Update: Well, I learned something new about GitHub's mark up :D All the jsons are pretty printed now. Sorry again for all the trouble you went through reading those scary lines of json. Never gonna happen again!
@missinglink I further investigated this issue, thanks to your helpful reply and I believe I might have found a bug. I will try to describe my test plan with every details so that we can figure this out: So, to make sure that the version of Pelias was not causing this problem, I created an EC2 instance and pulled the latest images directly from Pelias's docker repo. Then I did the following:
1- Created two Amazon Elastic Search instances (AES) and modified the pelias.json file inside planet and north-america to point to these two instances (one for each). Let's call them AES-planet and AES-na. 2- Noticed the configuration for whosonfirst for the planet project looks like this:
"whosonfirst": {
"datapath": "/data/whosonfirst",
"importVenues": false,
"importPostalcodes": true
}
and the same configuration for north america looks like this:
"whosonfirst": {
"datapath": "/data/whosonfirst",
"importPostalcodes": true,
"importPlace": "102191575"
}
so far, it makes sense because for north america we're just going to download a portion of the whole wof data, hence the importPlace is there. I also checked the codebase, wondering about importVenues which is false for planet but not specified for north america, but then I figured out that if not specified, the default value would be false.
Now, things start to get interesting.
3- I downloaded wof for planet using pelias download wof
and then imported it into AES-planet. 3,572,815 documents were indexed, then concerned about the same address mentioned in this issue: OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030, I accessed the Kibana interface and filtered the data as below:
source: whosonfirst, parent.region: virginia
and then searched for the word Fairfax and got 0 results!
4- Did the same steps for north america and got 1,371,851 documents indexed in AES-na. This time, accessed the kibana interface for AES-na and added the same filters and searched for Fairfax and boom, there were about 500 documents returned as the result of the search.
Some differences I noticed in the downloaded files for wof using planet vs. north america.
[whosonfirst-sqlite-download] https://dist.whosonfirst.org/sqlite/whosonfirst-data-latest.db.bz2
[whosonfirst-sqlite-decompress] /data/whosonfirst/sqlite/whosonfirst-data-latest.db.bz2
[whosonfirst-sqlite-download] https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-vi-latest.db.bz2
[whosonfirst-sqlite-download] https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-pa-latest.db.bz2
[whosonfirst-sqlite-download] https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-sx-latest.db.bz2
[whosonfirst-sqlite-download] https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-bl-latest.db.bz2
while in the logs for the planet's attempt at downloading wof, I saw no reference to whosonfirst-sqlite. Here's the first few lines of the logs:
Downloading whosonfirst-data-ocean-latest.tar.bz2 bundle
Downloading whosonfirst-data-marinearea-latest.tar.bz2 bundle
Downloading whosonfirst-data-continent-latest.tar.bz2 bundle
Downloading whosonfirst-data-empire-latest.tar.bz2 bundle
done downloading whosonfirst-data-ocean-latest.tar.bz2 bundle
Downloading whosonfirst-data-country-latest.tar.bz2 bundle
done downloading whosonfirst-data-empire-latest.tar.bz2 bundle
Downloading whosonfirst-data-dependency-latest.tar.bz2 bundle
done downloading whosonfirst-data-marinearea-latest.tar.bz2 bundle
Downloading whosonfirst-data-disputed-latest.tar.bz2 bundle
done downloading whosonfirst-data-disputed-latest.tar.bz2 bundle
Downloading whosonfirst-data-macroregion-latest.tar.bz2 bundle
done downloading whosonfirst-data-dependency-latest.tar.bz2 bundle
Downloading whosonfirst-data-region-latest.tar.bz2 bundle
done downloading whosonfirst-data-continent-latest.tar.bz2 bundle
Downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
done downloading whosonfirst-data-macroregion-latest.tar.bz2 bundle
Downloading whosonfirst-data-county-latest.tar.bz2 bundle
done downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
Downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
done downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
Downloading whosonfirst-data-localadmin-latest.tar.bz2 bundle
done downloading whosonfirst-data-country-latest.tar.bz2 bundle
Downloading whosonfirst-data-locality-latest.tar.bz2 bundle
I hope these details could help here to figure out what is going on. I could simply just be me forgetting to do a step for the planet build, or it could be an existing bug.
I also want to add this that the performance (accuracy in geocoding addresses) of our north america build for the same 200 addresses is around 90% which is amazing but due to the problem mentioned here our performance for the planet build is at around 60%.
Hi folks, Just came across this old issue. It was caused by corrupt data hosted by the old Who's on First data download service, since https://github.com/pelias/whosonfirst/pull/487 back in April Geocode Earth is building and hosting this data, (and it is corruption free! :) ), and all importers should use that data.
Hey team!
For testing purposes, we decided to build a north america version of Pelias to be able to geocode US addresses only and we got rrrreally rrrrreally good performance. But then we have a planet build as well, and we tried to run the same addresses through our planet build and this time the performance was not good at all. Not even close to what we got from the north america build. We were curious to figure out what could cause this degradation in the performance of our planet build and decided to dive into querying the Elastic Search index directly. So, we queried the same source_ids through Kibana on both ElasticSearch instances and we noticed that the north america one has more fields in its schema for that document compared to the planet build. The fields that were missing from the planet build are: parent.county, parent.county_a, parent.county_id, parent.locality, parent.locality_a, and parent.locality_id
due to these fields not being there in the planet index, the same address that can be geocoded in our north america build, will return a less accurate result in our planet build (up to the city level).
I am wondering, why would the same build process cause the schemas of the two builds significantly different? Oh, another thing we tried was to query the exact address against your api provided through geocode.earth and quite interestingly it returned the exact same response that we got from our own planet build and not an exact match.
For more clarity, I'm going to add example addresses along with the json responses that I get from our north america build and the planet build:
Address: OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030
north america build's response:
planet build's response:
but for this address: 4000 MERIDIAN BLVD STE 750, FRANKLIN TN 37067 our planet build has all those parent fields that were missing from the previous response. Here's the response for this address from the planet's build: