osm-search / Nominatim

Open Source search based on OpenStreetMap data
https://nominatim.org
GNU General Public License v3.0
3.18k stars 713 forks source link

Calculate postcodes from place table #2049

Closed lonvia closed 3 years ago

lonvia commented 3 years ago

If we could compute the location_postcode table directly from place instead of placex, it wouldn't be necessary anymore to copy over the about 3M postcode helper points into placex. That would be a nice save of space and would allow to get rid of a couple of special cases in the placex triggers.

The main issue is that we don't have country_codes in the place table, so they would need to be computed on the fly. It largely depends on how much this degrades performance of postcode updates, if this idea is feasible.

Pointers to the relevant code:

Vineet-Sharma29 commented 3 years ago

@lonvia ,I want to work on this issue. I checked linked files in description but seems their location has changed. Can you please guide with the issue?

lonvia commented 3 years ago

@Vineet-Sharma29 Have you read https://wiki.openstreetmap.org/wiki/User:Lonvia/GSoC_2021_Nominatim_Projects already?

Vineet-Sharma29 commented 3 years ago

@lonvia I have read the guide and installed Nominatim using docker.

lonvia commented 3 years ago

I've updated the links in the issue, although I'm sure you have found the code yourself by now.

AntoJvlt commented 3 years ago

I will work on this issue if it is ok for everyone.

As @lonvia already ported the computation of postcode table to python, the only part that should be changed for this is the computation of the list of valids postcodes from placex. Am I right?

For the country_code missing problem, could we for example look for it inside placex by joining on osm_id? We would need to check if it doesn't degrade performances too much as already said.

One problem I have is that it is not really clear for me where we "copy over the about 3M postcode helper points into placex" as we could remove this in order to benefit from the change. Could you point me out the code related to this part please?

Thanks you.

lonvia commented 3 years ago

For the country_code missing problem, could we for example look for it inside placex by joining on osm_id?

There is a SQL function get_country_code. See it in use for example here.

One problem I have is that it is not really clear for me where we "copy over the about 3M postcode helper points into placex" as we could remove this in order to benefit from the change. Could you point me out the code related to this part please?

osm2pgsql originally loads the OSM data into the place table. It is then copied over into the final placex table in the load_data function and all further processing on the data happens in the placex table. This complicated setup is there, so that when the data is later updated, we can compare the place and placex tables and see if there are really relevant changes that require the Nominatim tables to be reprocessed.

Lines in placex with postcode nodes (osm_type = N, class=place, type = postcode) are never queried because we have the special location_postcodes table that gets uses when somebody searches for a postcode. In the trigger function that does all the address computation and fills the search indexes, postcodes are for that reason simply ignored. But they take space nonetheless, especially in the indexes. So it would be nice if we don't copy them over in the first place.

NB: copying from the place to placex tables in the case of updates is a bit more complicated. This happens via the insert trigger of the place table. "Not copying the postcode data" needs to be added there, too.

AntoJvlt commented 3 years ago

So, I will try to summarize, please let me know if its correct:

Another question I have is about the places with addr:postcode. I am pretty sure we have to keep those (if they are not nodes of type=postcode) because the addr:postcode is only an additional information concerning a more global place which should be kept.

I'm not sure I haven't forgotten anything. Also, it will be hard for me to make good tests concerning the performance impact as I don't have a big dataset on my computer.

lonvia commented 3 years ago

Yes, that sounds right.

I'd suggest you start with modifying the update code and do the performance measurements. Then we can see if it is worth doing the rest.

Also, it will be hard for me to make good tests concerning the performance impact as I don't have a big dataset on my computer.

Ah, but luckily you have access to a powerful machine that has a full planet import. ;) You can simply run a nominatim refresh --postcodes before and after the modification of the update code. The refresh always makes a full computation of the postcodes over the placex/place table. So this should get us the interesting performance numbers regarding the country code computation.

AntoJvlt commented 3 years ago

I conducted some tests. With a full planet on my server, the current refresh postcode takes 8m 53s.

The first solution using get_country_code seems good. The query I used is:

SELECT * FROM (
    SELECT get_country_code(centroid) as country_code, pc, ST_X(centroid), ST_Y(centroid)
    FROM (
        SELECT 
            token_normalized_postcode(address->'postcode') as pc,
            ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
        FROM place
        WHERE address ? 'postcode' and geometry IS NOT null
        GROUP BY pc
    ) xx
    WHERE pc is not null
) xx
WHERE country_code is not null
ORDER BY country_code, pc

The query was executed in 11m 07s which is very acceptable, so I guess that I will got for it.

Let me know if something doesn't looks good.

lonvia commented 3 years ago

I don' t think the query is doing the same thing as before. The inner SELECT already groups by postcode. That means that the same postcode is merged all over the world but we only want same postcodes from the same country to be merged. So unfortunately there is no way around computing the country_code for them all.

AntoJvlt commented 3 years ago

Oh yes I naively forgot that the country really matters for the group by.

Problem is that the query which computes the country_codefor each postcode in the world was executed in 1h 57m 44s which isn't really acceptable I think:

SELECT country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
    SELECT 
        get_country_code(ST_Centroid(geometry)) as country_code,
        token_normalized_postcode(address->'postcode') as pc,
        ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
    FROM place
    WHERE address ? 'postcode' and geometry IS NOT null
    GROUP BY country_code, pc
) xx
WHERE pc is not null AND country_code is not null
ORDER BY country_code, pc

Wouldn't it be possible to compute the country_codeonce when we import the data into the place table by adding a country_codecolumn? And if it is complicated as the place table is handled by osm2pgsql, couldn't we create a country code table which would contains a column for the osm_idand a column for the country_code? So that we wouldn't have to recompute the country_codeeach time.

What do you think?

lonvia commented 3 years ago

I'm a bit reluctant to precompute in osm2pgsql. We'd have to resort to insert triggers on the place table and that's difficult to set up.

Lets go back to your earlier idea to get the country_code from placex. While it's true that the 3 million pure postcodes won't be in placex anymore, the 70 million full addresses with postcodes are still there. So how about a hybrid approach of first checking placex and only if that fails computing it. Something like COALESCE(SELECT country_code FROM placex, get_country_code(centroid)) should do the trick.

AntoJvlt commented 3 years ago

Yes I didn't think of that, I am currently running this query.

But doesn't this create the same problem as grouping only by postcode? Like if we check into placex for the same postcode as the one into place to get the country_code, we a risking to get the wrong country (if two countries have the same postcode) no?

We could check for the full address (or more address details comparisons) but it would limit the results a lot and I am not sure that it is more accurate.

Were you thinking of something more specific than what I understood?

lonvia commented 3 years ago

I was thinking of using the placex row with the same osm_type + osm_id as the row in place like you suggested.

I'm not sure I follow your argument. Can you post the query you did?

AntoJvlt commented 3 years ago

Ohh yes I am tired and I forgot half of the data that we can use, that's totally my bad. What you were saying is absolutely true.

So I will run two of those queries for performance purpose, the first one as you proposed:

SELECT country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
    SELECT 
        COALESCE((SELECT country_code FROM placex WHERE osm_id = place.osm_id AND osm_type = place.osm_type LIMIT 1),
        get_country_code(ST_Centroid(geometry))) as country_code,
        token_normalized_postcode(address->'postcode') as pc,
        ST_Centroid(ST_Collect(ST_Centroid(geometry))) as centroid
    FROM place
    WHERE address ? 'postcode' and geometry IS NOT null
    GROUP BY country_code, pc
) xx
WHERE pc is not null AND country_code is not null
ORDER BY country_code, pc

(LIMIT 1 is theoretically not necessary)

And the second one with one big join which I think might be better than doing a full lookup each time, but I am not totally sure as I am doing an outer left join so it will be very big:

SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
    SELECT 
        COALESCE(plx.country_code, get_country_code(ST_Centroid(pl.geometry))) as cc,
        token_normalized_postcode(pl.address->'postcode') as pc,
        ST_Centroid(ST_Collect(ST_Centroid(pl.geometry))) as centroid
    FROM place AS pl LEFT OUTER JOIN placex AS plx ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
    WHERE pl.address ? 'postcode' and pl.geometry IS NOT null
    GROUP BY cc, pc
) xx
WHERE pc IS NOT null AND cc IS NOT null
ORDER BY country_code, pc
AntoJvlt commented 3 years ago

So it seems good with the following query which executes in 23m 32s:

SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
FROM (
    SELECT 
        COALESCE(plx.country_code, get_country_code(ST_Centroid(pl.geometry))) as cc,
        token_normalized_postcode(pl.address->'postcode') as pc,
        ST_Centroid(ST_Collect(ST_Centroid(pl.geometry))) as centroid
    FROM place AS pl LEFT OUTER JOIN placex AS plx ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
    WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null
    GROUP BY cc, pc
) xx
WHERE pc IS NOT null AND cc IS NOT null
ORDER BY country_code, pc

The other query with the SELECT inside the COALESCE is way too long to execute (many hours) so it is not a possibility. I tried some optimization by applying filters WHERE pl.address ? 'postcode' and pl.geometry IS NOT null before the join, but it doesn't seems to increase performances.

Please let me know if that query can do the job or if I missed something.

lonvia commented 3 years ago

You are, of course, right. A JOIN is the way to go. It's still a bit slow but it's in the half-hour limit and the other advantages are bigger.

Thinking some more about your other suggestion to have an extra table: a MATERIALIZED VIEW might come in use here. They might have optimisations for a faster refresh. But I'm not sure, I've never used them before.

In any case, let's go with the JOIN query first and think about more optimisations later.

lonvia commented 3 years ago

I just remembered that we drop the place table when freezing the database. That's fine but something we should print a helpful message for. Can you add an initial check that the place table exists and fail with message that postcode updates on a frozen database is not possible?

AntoJvlt commented 3 years ago

Yes, no problems I will do that.

AntoJvlt commented 3 years ago

One more thing about data:

To filter pure postcodes before the insertion into placex, we were going to use osm_type = N AND class=place AND type = postcode but this only concerns about 80 000 entries. To get the 3 millions pure postcodes as you said, I have to query only for elements with type=postcode (which returns nodes, ways and relations).

Querying only for type=postcode makes sens for me, is it ok? (of course class=place is not needed as every type='postcode' is a class place)

lonvia commented 3 years ago

True, any OSM object can end up with only the postcode tag.

You are right about class=place in the current database but it's better to leave it in anyway to make the code future-proof.