Closed lonvia closed 3 years ago
@lonvia ,I want to work on this issue. I checked linked files in description but seems their location has changed. Can you please guide with the issue?
@Vineet-Sharma29 Have you read https://wiki.openstreetmap.org/wiki/User:Lonvia/GSoC_2021_Nominatim_Projects already?
@lonvia I have read the guide and installed Nominatim using docker.
I've updated the links in the issue, although I'm sure you have found the code yourself by now.
I will work on this issue if it is ok for everyone.
As @lonvia already ported the computation of postcode table to python, the only part that should be changed for this is the computation of the list of valids postcodes from placex. Am I right?
For the country_code missing problem, could we for example look for it inside placex by joining on osm_id? We would need to check if it doesn't degrade performances too much as already said.
One problem I have is that it is not really clear for me where we "copy over the about 3M postcode helper points into placex" as we could remove this in order to benefit from the change. Could you point me out the code related to this part please?
Thanks you.
For the country_code missing problem, could we for example look for it inside placex by joining on osm_id?
There is a SQL function get_country_code
. See it in use for example here.
One problem I have is that it is not really clear for me where we "copy over the about 3M postcode helper points into placex" as we could remove this in order to benefit from the change. Could you point me out the code related to this part please?
osm2pgsql originally loads the OSM data into the place
table. It is then copied over into the final placex
table in the load_data function and all further processing on the data happens in the placex table. This complicated setup is there, so that when the data is later updated, we can compare the place and placex tables and see if there are really relevant changes that require the Nominatim tables to be reprocessed.
Lines in placex with postcode nodes (osm_type = N, class=place, type = postcode
) are never queried because we have the special location_postcodes
table that gets uses when somebody searches for a postcode. In the trigger function that does all the address computation and fills the search indexes, postcodes are for that reason simply ignored. But they take space nonetheless, especially in the indexes. So it would be nice if we don't copy them over in the first place.
NB: copying from the place to placex tables in the case of updates is a bit more complicated. This happens via the insert trigger of the place table. "Not copying the postcode data" needs to be added there, too.
So, I will try to summarize, please let me know if its correct:
AND NOT (osm_type = N, class=place, type = postcode)
osm_type = N, class=place, type = postcode
get_country_code
sql function to get the country_code (by calculating and giving it the centroid based on the place's geometry)Another question I have is about the places with addr:postcode
. I am pretty sure we have to keep those (if they are not nodes of type=postcode) because the addr:postcode
is only an additional information concerning a more global place which should be kept.
I'm not sure I haven't forgotten anything. Also, it will be hard for me to make good tests concerning the performance impact as I don't have a big dataset on my computer.
Yes, that sounds right.
I'd suggest you start with modifying the update code and do the performance measurements. Then we can see if it is worth doing the rest.
Also, it will be hard for me to make good tests concerning the performance impact as I don't have a big dataset on my computer.
Ah, but luckily you have access to a powerful machine that has a full planet import. ;) You can simply run a nominatim refresh --postcodes
before and after the modification of the update code. The refresh always makes a full computation of the postcodes over the placex/place table. So this should get us the interesting performance numbers regarding the country code computation.
I conducted some tests. With a full planet on my server, the current refresh postcode takes 8m 53s.
The first solution using get_country_code
seems good. The query I used is:
SELECT * FROM (
SELECT get_country_code(centroid) as country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
SELECT
token_normalized_postcode(address->'postcode') as pc,
ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
FROM place
WHERE address ? 'postcode' and geometry IS NOT null
GROUP BY pc
) xx
WHERE pc is not null
) xx
WHERE country_code is not null
ORDER BY country_code, pc
The query was executed in 11m 07s which is very acceptable, so I guess that I will got for it.
Let me know if something doesn't looks good.
I don' t think the query is doing the same thing as before. The inner SELECT already groups by postcode. That means that the same postcode is merged all over the world but we only want same postcodes from the same country to be merged. So unfortunately there is no way around computing the country_code for them all.
Oh yes I naively forgot that the country really matters for the group by.
Problem is that the query which computes the country_code
for each postcode in the world was executed in 1h 57m 44s which isn't really acceptable I think:
SELECT country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
SELECT
get_country_code(ST_Centroid(geometry)) as country_code,
token_normalized_postcode(address->'postcode') as pc,
ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
FROM place
WHERE address ? 'postcode' and geometry IS NOT null
GROUP BY country_code, pc
) xx
WHERE pc is not null AND country_code is not null
ORDER BY country_code, pc
Wouldn't it be possible to compute the country_code
once when we import the data into the place table by adding a country_code
column? And if it is complicated as the place table is handled by osm2pgsql, couldn't we create a country code table which would contains a column for the osm_id
and a column for the country_code
? So that we wouldn't have to recompute the country_code
each time.
What do you think?
I'm a bit reluctant to precompute in osm2pgsql. We'd have to resort to insert triggers on the place table and that's difficult to set up.
Lets go back to your earlier idea to get the country_code from placex. While it's true that the 3 million pure postcodes won't be in placex anymore, the 70 million full addresses with postcodes are still there. So how about a hybrid approach of first checking placex and only if that fails computing it. Something like COALESCE(SELECT country_code FROM placex, get_country_code(centroid))
should do the trick.
Yes I didn't think of that, I am currently running this query.
But doesn't this create the same problem as grouping only by postcode? Like if we check into placex
for the same postcode
as the one into place
to get the country_code
, we a risking to get the wrong country (if two countries have the same postcode) no?
We could check for the full address (or more address details comparisons) but it would limit the results a lot and I am not sure that it is more accurate.
Were you thinking of something more specific than what I understood?
I was thinking of using the placex row with the same osm_type
+ osm_id
as the row in place like you suggested.
I'm not sure I follow your argument. Can you post the query you did?
Ohh yes I am tired and I forgot half of the data that we can use, that's totally my bad. What you were saying is absolutely true.
So I will run two of those queries for performance purpose, the first one as you proposed:
SELECT country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
SELECT
COALESCE((SELECT country_code FROM placex WHERE osm_id = place.osm_id AND osm_type = place.osm_type LIMIT 1),
get_country_code(ST_Centroid(geometry))) as country_code,
token_normalized_postcode(address->'postcode') as pc,
ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
FROM place
WHERE address ? 'postcode' and geometry IS NOT null
GROUP BY country_code, pc
) xx
WHERE pc is not null AND country_code is not null
ORDER BY country_code, pc
(LIMIT 1
is theoretically not necessary)
And the second one with one big join which I think might be better than doing a full lookup each time, but I am not totally sure as I am doing an outer left join so it will be very big:
SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
SELECT
COALESCE(plx.country_code, get_country_code(ST_Centroid(pl.geometry))) as cc,
token_normalized_postcode(pl.address->'postcode') as pc,
ST_Centroid(ST_Collect(ST_Centroid(pl.geometry))) as centroid
FROM place AS pl LEFT OUTER JOIN placex AS plx ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
WHERE pl.address ? 'postcode' and pl.geometry IS NOT null
GROUP BY cc, pc
) xx
WHERE pc IS NOT null AND cc IS NOT null
ORDER BY country_code, pc
So it seems good with the following query which executes in 23m 32s:
SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
SELECT
COALESCE(plx.country_code, get_country_code(ST_Centroid(pl.geometry))) as cc,
token_normalized_postcode(pl.address->'postcode') as pc,
ST_Centroid(ST_Collect(ST_Centroid(pl.geometry))) as centroid
FROM place AS pl LEFT OUTER JOIN placex AS plx ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null
GROUP BY cc, pc
) xx
WHERE pc IS NOT null AND cc IS NOT null
ORDER BY country_code, pc
The other query with the SELECT
inside the COALESCE
is way too long to execute (many hours) so it is not a possibility.
I tried some optimization by applying filters WHERE pl.address ? 'postcode' and pl.geometry IS NOT null
before the join, but it doesn't seems to increase performances.
Please let me know if that query can do the job or if I missed something.
You are, of course, right. A JOIN is the way to go. It's still a bit slow but it's in the half-hour limit and the other advantages are bigger.
Thinking some more about your other suggestion to have an extra table: a MATERIALIZED VIEW might come in use here. They might have optimisations for a faster refresh. But I'm not sure, I've never used them before.
In any case, let's go with the JOIN query first and think about more optimisations later.
I just remembered that we drop the place table when freezing the database. That's fine but something we should print a helpful message for. Can you add an initial check that the place table exists and fail with message that postcode updates on a frozen database is not possible?
Yes, no problems I will do that.
One more thing about data:
To filter pure postcodes before the insertion into placex, we were going to use osm_type = N AND class=place AND type = postcode
but this only concerns about 80 000 entries. To get the 3 millions pure postcodes as you said, I have to query only for elements with type=postcode
(which returns nodes, ways and relations).
Querying only for type=postcode
makes sens for me, is it ok? (of course class=place
is not needed as every type='postcode' is a class place)
True, any OSM object can end up with only the postcode tag.
You are right about class=place
in the current database but it's better to leave it in anyway to make the code future-proof.
If we could compute the
location_postcode
table directly fromplace
instead ofplacex
, it wouldn't be necessary anymore to copy over the about 3M postcode helper points into placex. That would be a nice save of space and would allow to get rid of a couple of special cases in the placex triggers.The main issue is that we don't have country_codes in the
place
table, so they would need to be computed on the fly. It largely depends on how much this degrades performance of postcode updates, if this idea is feasible.Pointers to the relevant code: