opencaching / opencaching-pl

The source code of Opencaching.PL (and some other domains)
https://opencaching.pl/
GNU General Public License v3.0
22 stars 33 forks source link

Region from coordinates - extended look #2426

Open andrixnet opened 2 months ago

andrixnet commented 2 months ago

Given the recent hickup described in issue #2423 I'd like to open a discussion and maybe make some improvements based on OC NODE specific circumstances.

On OCRO I have NUTS data. Editing a cache automatically (via JS and AJAX) sets the "Region" field. I assume it's done via NUTS. Currently pressing "Region from coordinates" yields an informative box with the following sources:

There are several possible sources for this info, some requiring API key and possibly payment. Although this is a volunteer and community supported project (ie each OC national site), it might be concievable that OC site X might prefer evena subscription based service.

As a general concept, I would suggest the following strategy:

smth like:

$geocoding['services'] = 'nuts, nominatim';
$geocoding['nuts']['source'] = 'local';
$geocoding['nuts']['level'] = 3;
$geocoding['nominatim']['source'] = 'xxxxxxx';
// etc

This concept would make handling per site much more straightforward.

On the other hand, OCRO is still affected by #103 as our "judeţ" zoning is the appropriate one, but we can't configure NUTS, it's hardcoded and changing anything would break other nodes. Also see #1838

right now NUTS answer and Nominatim answer differ: region

In the least, configurable NUTS layer would be of great help.

rapotek commented 2 months ago

In 2019, following #1977, I had been researching a more flexible approach for area codes and shapes. That resulted in tables defined as follow:

CREATE TABLE `area_codes` (
    `code` VARCHAR(127) NOT NULL COMMENT 'Full code, prefixed by code_type and colon',
    `code_type` VARCHAR(30) NOT NULL COMMENT 'Code type, f.ex. iso3166-1, iso3166-2, fips6-4, nuts-2',
    `country` VARCHAR(2) NOT NULL COMMENT 'ISO 3166-1 alpha-2 country code (caution: NUTS uses EL for Greece, here should be GR)',
    `bare_code` VARCHAR(50) DEFAULT NULL COMMENT 'Non-prefixed code',
    `name` VARCHAR(255) DEFAULT NULL COMMENT 'Area name, in English or main native language',
    `parent_code` VARCHAR(127) DEFAULT NULL COMMENT 'Full code of a parent, may be NULL in case of country',
    `adm_level` SMALLINT(5) UNSIGNED DEFAULT NULL COMMENT 'Area administrative level',
    `last_modified` DATETIME NOT NULL COMMENT 'Last modification date and time',
    PRIMARY KEY (`code`),
    KEY (`code_type`,`bare_code`),
    KEY (`country`,`bare_code`),
    KEY (`parent_code`),
    FOREIGN KEY (`parent_code`) REFERENCES `area_codes` (`code`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `area_shapes` (
    `code` VARCHAR(127) NOT NULL COMMENT 'Full code, prefixed by code_type and colon',
    `shape` GEOMETRY NOT NULL COMMENT 'Geometry of country or subdivision boundaries',
    `min_lon` FLOAT(10,7) DEFAULT NULL COMMENT 'Minimal longitide of code area shape',
    `min_lat` FLOAT(9,7) DEFAULT NULL COMMENT 'Minimal latitude of code area shape',
    `max_lon` FLOAT(10,7) DEFAULT NULL COMMENT 'Maximal longitide of code area shape',
    `max_lat` FLOAT(9,7) DEFAULT NULL COMMENT 'Maximal latitude of code area shape',
    `source_datetime` DATETIME DEFAULT NULL 'Date and time of a shape update according to its source, f.ex. openstreetmap',
    `last_modified` DATETIME NOT NULL COMMENT 'Last modification date and time',
    PRIMARY KEY (`code`),
    SPATIAL KEY (`shape`),
    FOREIGN KEY (`code`) REFERENCES `area_codes` (`code`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

In short, above tables should allow using shapes and region codes from different sources (iso3166-2, NUTS, FIPS etc) with uniformed iso3166-1 alpha-2 country codes, where config should set a priority of codes in case of multiple matches. In addition I had written example scripts for fetching area shapes and codes from OSM overpass and tested effective SQL queries for searching for area codes by location. In theory this should allow to store the entire world shapes in database, but the result would be huge. Alternatives are: simplifying area shapes to be less exact - this can dramatically reduce database size at the cost of increased risk of incorrect location-area matching near an area boundaries - or implementing your concept of configurable using external reverse-geocoding services in case of missing shapes in local.

I had abandoned this work because of lack of interest in introducing the solutions into opencaching code.

andrixnet commented 2 months ago

At OCRO, having NUTS data, my primary interest is to associate a different adm_level then the one currently hardcoded.

OCRO has data for adm_level 0, 1, 2, 3 OCRO uses level 2 but given how the data is organized, it needs to use level 3

andrixnet commented 2 months ago

I'm not sure how geometries are stored in the other formats and/or if conversions would be required (looking at your table design).

I wouldn't simplify geometries because it could lead to all sorts of interesting (and potentially frustrating) results near boundaries. Besides, OC focuses on one country, there is no point in mapping the entire workd in the database.

min_lon, max_lon, etc, were designed to be some sort of bounding box optimization? Because MySQL/MariaDB has separate functions to query if a given point is inside the bounding box of a given shape, or inside the shape proper. (within and st_within IIRC)

AFAIK opencaching-pl code at some point (in the past - 2014-2017 at least) did a separate PHP implementation to check "inside shape proper" as if there was no (or properly working) st_within at that time. Code which I'm sure was slower then the MySQL native.

Also, since I got the NUTS data from one of the OCPL developers, I believe it would be interesting and useful to put together some procedures to obtain/update such data. Such as sources, conversions required. Apppliable to region data and npa. (currently OCRO as no npa data ... and whatever support the site has for npa, doesn't work)

andrixnet commented 2 months ago

P.S. I'm also trying to dust off my SQL and basic PHP knowledge ... :-)

rapotek commented 2 months ago

Geometries are geometries, ie. simply shapes regardless of source of origin, there is no conversion. max and min were added because in time of testing with database filled up with thousands of shape a query SELECT code FROM area_shapes WHERE code IN (SELECT code FROM area_shapes WHERE @lon BETWEEN min_lon AND max_lon AND @lat BETWEEN min_lat AND max_lat) AND ST_Contains(shape, POINT(@lon, @lat)) ORDER BY ST_Area(shape); was a few hundred times faster than SELECT code FROM area_shapes WHERE ST_Contains(shape, POINT(@lon, @lat)) ORDER BY ST_Area(shape);.

This could have changed because in newer mysql/mariadb version spatial searching could be much better optimized. I cannot do comparisons by now because I would have to recreate database and fill up area_shapes table with correct data. I didn't tested st_within, maybe this would be fast enough to get rid max and min fields.

As for adm_levels - the table is open for any level you can provide, its usage in code is still open: f.ex. use a maximal adm_level from matching codes/shapes but not greater than configured one.

andrixnet commented 2 months ago

I've looked nito the code again. Some has changed since my last forray into this.

src/Models/Coordinates/NutsLocation.php private function loadFromCoords(Coordinates $coords)

This uses ST_WITHIN in the SQL query.

I've done some tests manually, this query returns (on my database and coordinates from Romania) levels 0,1,2,3. point(24.3 46.2) ->

+-------+-------+-------------------+
| level | code  | name              |
+-------+-------+-------------------+
|     3 | RO126 | Sibiu             |
|     2 | RO12  | Centru            |
|     1 | RO1   | Macroregiunea Unu |
|     0 | RO    | România           |
+-------+-------+-------------------+

Note: returning AsText(shape) AS geometry data in this query I think is useless and only consumes memory between mysql server and php for nothing.

Not sure why private function loadNamesForCodes() exists. All relevant info can be obtained in a single query like this:

select `level`, n2.code, n2.name from nuts_layer n1 left join nuts_codes n2 using (code) where st_within( GeomFromText ( "point(24.3 46.2)" ), shape) order by `level` desc

Is public function getRegionName() responsible for returning level 2 ?

        // try to detect region name in order 2-1-3 (NUTS-levels)
        // (smaller countries has e.g. only 3-level names)

At least in the data I have (for most countries of Europe) appears to have data at all levels for most countries.

OCUS however has data for levels 0 and 2 only.

Does this also favour level 2 ?

    public static function getRegionsListByCountryCode($countryCode)
    {
        $countryCode .= '__'; // add sql wildcard (two letters)