nvkelso / natural-earth-vector

A global, public domain map dataset available at three scales and featuring tightly integrated vector and raster data.
https://www.naturalearthdata.com/
Other
1.74k stars 370 forks source link

Add Wikidata and Who's On First concordances for populated places #214

Closed nvkelso closed 6 years ago

nvkelso commented 6 years ago

Join first to Who's On First based on the common GeoNames concordance and harvest Wikidata IDs from the Who's On First concordances. Verify the result by joining with OpenStreetMap and make one-off edits to fix any funk and fill in the gaps.

nvkelso commented 6 years ago

The bulk of this was completed in 14cd3060fe5d347262c438d849b79070bfbdc3ce by joining against GeoNamesID in NE to gn:id in WOF concordance file. The wof_id and wikidataid were copied over for ~ 5800 features.

Touch-ups for zooms 2, 3, and 4 were completed in 5fa8401e16cf7e98615e74ec5f57570adfb095f0. There were somewhere between 20 and 50 of these. In some cases the GeoNamesID in NE was bad, in some other cases the gn:id in WOF was bad (like Honolulu), in other cases the feature doesn't exist in WOF (yet).

Current status:

Notes on future work:

Due to count over 1000, an semi-automated approach is needed for the remainder.

nvkelso commented 6 years ago

Next step is probably running the whole set thru the python scripts in https://github.com/mapzen-data/wikipedia-notebooks to verify the existing work and fill in the gaps.

ImreSamu commented 6 years ago

QA: After re-thinking the problem, I started with Wikidata geosparql based queries ( so no licensing issues)

This is a first example report, based on the latest commit :

nvkelso commented 6 years ago

There are 403 new matches via geonameid to wof_id via a file from @stepps00 based on his remainder import of Natural Earth places into WOF a few weeks ago. But a few of those are suspect based on work I did last night so only 395 features wof_id were copied over via 6a6c17518f9f0fa7611f2a6770885b94f4728699.

screen shot 2017-08-22 at 9 19 32 pm

nvkelso commented 6 years ago

@ImreSamu 2.5% error rate isn't bad! :) I've spot checked your list and I agree with the changes. I've made them with 143 changed wikidataid concordances via 51851a43090f871151c40cb6a9565beefdda8a9d.

screen shot 2017-08-22 at 9 25 40 pm

(The FID isn't necc. stable so I did a join based on ~ name & admin0 & the_old_wikidataid.)

nvkelso commented 6 years ago

838 places added wof_id via 3529fe5a2583bdde1dc3c63514105f4e37a5f890, in addition to the 403 in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324217775. This is now 100% complete for wof_id concordances. (But we remain 5430 features have wikidataid concordance at 74% concordance.)

screen shot 2017-08-22 at 11 12 55 pm

There look to be 13 places that need revision in Who's On First (they look like new IDs for existing place IDs). I think the wof_id on the left is good, and the one on the right should be superseded into the one on the left. /cc @stepps00

screen shot 2017-08-22 at 11 06 59 pm

ImreSamu commented 6 years ago

@nvkelso: Thanks! :)

I have created a new wikidataid proposals ( 504 new wikidataid ) :

The other missing 1409 is not so easy to match. As I see, the main problem categories:

EDIT:

nvkelso commented 6 years ago

Added those 504 via https://github.com/nvkelso/natural-earth-vector/commit/dc220c7d659a0dd2fa037002e0e309ffa014474c.

screen shot 2017-08-23 at 10 23 39 pm

nvkelso commented 6 years ago

@ImreSamu can you share the Wikidata SPARQL queries here, please, so we have documentation if it's needed again in the future?

nvkelso commented 6 years ago

Current status: 5430 + 504 = 5934 with wikidata id concordance out of 7343 total = 81%.

Of the 1409 that are still missing concordance, perhaps @ImreSamu can harvest another 600 per https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324396862.

In the meantime I've been looking at Olga's work from last summer (per https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-323953779). I've gotten it to run, but I seem to be missing the part where name + adm0name is searched for, or name + adm1name. Notwithstanding that...

I'm on vacation starting Friday and the following week. I'll post an update tomorrow and then be away from the computer but on email.

nvkelso commented 6 years ago

Posting my intermediate results here as a GIST: https://gist.github.com/nvkelso/06393fcfda298c98571bb3d3a3845e8c.

I finished manually reviewing the top 550 sometimes false negatives. The positives after that I spot checked a lot less. They mostly seem right, except some are for disambiguation pages and rarely for a different place all-together.

@ImreSamu maybe you can run these candidate thru the same process to determine which you think are more or less valid, too?

ImreSamu commented 6 years ago

@nvkelso

You can find my NaturalEarth vs. Wikidata QA codes here: github.com/ImreSamu/natural-earth-vector-qa

I have created a simple scoring system based on

Current status:

Summary report v1

_status _wikidata_status _geonames_status n
S1-Very good match ( _score > 120) 66
S1-Very good match ( _score > 120) OK 160
S1-Very good match ( _score > 120) DIFF 18
S1-Very good match ( _score > 120) DIFF OK 127
S1-Very good match ( _score > 120) EQ 631
S1-Very good match ( _score > 120) EQ OK 4359
S2-Good match ( 90 - 120) 467
S2-Good match ( 90 - 120) OK 58
S2-Good match ( 90 - 120) DIFF 58
S2-Good match ( 90 - 120) DIFF OK 21
S2-Good match ( 90 - 120) EQ 10
S2-Good match ( 90 - 120) EQ OK 223
S3-Maybe ( 40 - 90) 174
S3-Maybe ( 40 - 90) OK 75
S3-Maybe ( 40 - 90) DIFF 25
S3-Maybe ( 40 - 90) DIFF OK 18
S3-Maybe ( 40 - 90) EQ 16
S3-Maybe ( 40 - 90) EQ OK 139
S4-Not found in wikidata ( score < 40) 416
S4-Not found in wikidata ( score < 40) DIFF 301
S4-Not found in wikidata ( score < 40) DIFF OK 17
S4-Not found in wikidata ( score < 40) EQ 1

Summary report v2

_status n
S1-Very good match ( _score > 120) 5361
S2-Good match ( 90 - 120) 837
S3-Maybe ( 40 - 90) 447
S4-Not found in wikidata ( score < 40) 735

comment for 735 cases in S4-Not found in wikidata ( score < 40) |735

klokan commented 6 years ago

Hi @nvkelso and @ImreSamu

It is great to see the efforts to add WikiDataIDs to Natural Earth Data.

I would like to contribute here with the manually verified conflicting or missing links between Natural Earth Data and related Wikipedia pages, if it helps you in the process.

These were done by students for Klokan Technologies about 5 years back (on very old Natural Earth Data and in time when WikiData were in infancy). The students have spent about a week on finding the proper Wikipedia articles for the records which were not linkable automatically (with defined Levenshtein distance and geographic location verification). The data is probably not usable directly, but they may help you save a bit of time on the cases which can't be automated:

Manually linked NE features to Wikipedia articles (those not linkable automatically)

https://gist.github.com/klokan/3d6d97c3d95856b18b8dcde81fe69e1b

Physical and cultural NE features with the same title, which should not be linked together:

https://gist.github.com/klokan/4e0800bcc04781e2c56cf57fc1e41b07

We waive all the copyrights on this - it is freely reusable. If you find it helpful it would be kind to mention KlokanTech on the announcements of results. We are just trying to help.

ImreSamu commented 6 years ago

Current status:

Columns:

Important comment:

I try to continue the testing ...

@nvkelso: Thank you for your GIST file, I have used in my manual testing, It is already helped a lot for detecting programming errors, better SPARQL filters, etc.

nvkelso commented 6 years ago

@ImreSamu wow, those are epic Wikidata threads! Thanks for sharing them, I can sympathize and feel their pain working with just a small selection in Natural Earth and a more comparable set in the Who's On First gazetteer.

Here's a good example, the 2nd in your diff for Targoviste, Romania:

Your newly proposed wikidata ID is for a page with lots of translations and a good English Wikipedia page. The ceb related one seems to be a "mirror world" GeoNames related import that's really a duplicate of the existing Wikidata feature.

Here's another for Beringovskiy, Russia:

Here the existing NE wikidata ID is for a disambiguation page, where your proposal is for the real place. Yeah!

Final example Douglass, Isle of Man:

Your proposal is actually in Isle of Man, the existing wikidata ID is for an obviously incorrect place in Scotland.

I'm noticing a trend that the lower the Wikidata ID value is probably the correct ID value. Might be useful to you when determining a tie-breaker in otherwise similar candidates.

I checked a few GeoNames.org concordances and I'm mixed about those. For instance, I've seen links to the admin feature instead of the capital of the same feature (often they are "unitary" features). I'm going to import them only in cases where NE doesn't already have a GN_ID.

A parallel Wikidata ID example of admin feature versus it's capital (unitary or otherwise) is Tel Aviv Q33935 and Tel Aviv-Yafo Municipality Q12410321). In this case the Natural Earth name is Tel Aviv-Yafo but your suggestion for Q33935 results in many more Wikipedia links with all the translations (but all missing -Yafo). This is really splitting hairs 😉 Generally Natural Earth favors the agglomeration name if it's a formal governing body (another example is the Gold Coast in Australia). If the use-case is linking up with OpenStreetMap's http://www.openstreetmap.org/relation/1382494 they list Q33935 but the names and the polygon shape in OSM suggest that's in error and it should be Q12410321 instead so I'm going to leave it. Anyhow... I'm only manually reviewing the top 10 of these that have an import scale rank of 2 or 3, and Tel Aviv-Yafo is the only one I take issue with. Sampling of the remainder I don't see any suspicious candidates :)

(The WD population is useful, but since we don't have a complete set I'm going to ignore importing it at all now but look forward to retaining it in your future analysis.)

I think it's worth while to import your diffs and new now. If one of them turns out to be funky / superseded in the whole Cebuano fix-up then we can update NE later (it's probably going to take them a while to work thru their funk, and your work is "better than yesterday" so let's carry if forward to tomorrow).

The import is in two commits:

@klokan Thanks for the GISTs in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-326575186. I'm not going to do anything with them immediately, but they have good content that may be useful later :)

nvkelso commented 6 years ago

@ImreSamu I'm happy where this is for a v4 Natural Earth release now. This does have implications that these 525 features wouldn't get included / translated depending on the join technique so that will need to be managed.

There are only 18 features at zoom 5 missing the Wikidata IDs, 378 at zoom 6, 116 at zoom 7, and 13 at zoom 8+. Most maps only translate names at zooms 0 to 6 so this is pretty good! 😄

It's probably more than a couple weeks work to sort thru this set to determine if OSM even has Wikidata links for them, if the Wikidata ID pages exist, & etc. (My earlier estimate in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324552142 was 259 missing minimum versus 525 actual now, so a chunk of work either way.) I've got a few other tasks for the v4 release to knock out so I'm going to move on to those now.

I could be persuaded otherwise if you see there is more to gain based on your earlier OpenMapTiles joins. What do you think?

ImreSamu commented 6 years ago

@nvkelso Thank you for accepting my suggestions.

Probably I need more time to finish the current checking ( about 1-2 weeks )

My planned to-do list for ~ NE v4.0

If we have a strict deadline for v4.0 then I need to re-think the priority. ( probably the T1, T4 is the most important for the next milestone ) imho: the current biggest business value will be the unicode-name fixes, not the wikidata-ids.


This is much bigger project than I expected, and still lot of problems on the wikidata side. And linking NE with OSM via wikidata-id is ( and fixing the problems) is an another 2-4 weeks project ( maybe the NE4.1? )

other comments:

nvkelso commented 6 years ago

In terms of timing I'd like to release on or just before the Montréal NACIS meeting which is 33 days away. I'll gladly take more contributions from you until Friday, Sept 29th, but after that I'll only be focused on packaging the release.

Yes, I'm happy for even more followup in a v4.1 issue with more changes! (See below for followup discussion and issues.) Thanks for all your help with this 😄

If we have a strict deadline for v4.0 then I need to re-think the priority. ( probably the T1, T4 is the most important for the next milestone )

I'd do it T4, T1, T2, T3 personally but whatever works for you 😉

imho: the current biggest business value will be the unicode-name fixes, not the wikidata-ids.

If you provide a 100% coverage table (all 7343 features) with the unicode changes those are easy for me to just accept and change the default NE name (since the nameascii is already there).

Some of the other fuzzy name matches I'd like to track for researching and fixing in the v4.1 milestone. Sometimes it's a transliteration stylistitic difference (like with Russian), sometimes it's because there is a , Countryname or similar post appended, sometimes it's a Spanish name with a crazy long formal name but the conventional name is shorter, sometimes former USSR place where they've redone all the names. All those take time to evaluate, accept, and ensure the old name is stored into nameparenthetical or namealt. Tracking this with: https://github.com/nvkelso/natural-earth-vector/issues/219.

Per T2 comment above for https://gist.github.com/ImreSamu/84d3603ef4cbf14a0550cdd8491531b2:

ne_adm0name wd_countrylabel count(*) as N
Aland Finland 1

Does this mean that there is 1 place that NE says is in Aland but Wikidata says is in Finland? That's someone explainable since Aland is a special region of Finland. Could be a subtle error, though.

Some of these are easy explainable (like Congo, China, and Curaçao). Others like Argentina/Bolivia, Gabon/Equitorial Guinea may be errors. The construction of the ne_adm0name value was a spatial join a long time ago between the populated places points and the NE 1:10,000,000 country themes with manual fixes over the years. Some options:

  1. NE adm0name is bad probably means the populated place lat/lng is wrong OR the adm0 polygon is wrong. The adm0name should be fixed, but the lat/lng &/or adm0 polygon should also be fixed so the name doesn't regress in the future.
  2. NE adm0name is good probably means the Wikidata page is wrong / has a difference of opinion about adm0 hierarchies.

For the first case there is a town on the Germany-Austria border that I fixed a couple days ago in the v4 series by moving it's lat/lng a tiny bit north-west. I can take care of edits like that if you point them out to me.

(bigger snippet of table below)

ne_adm0name wd_countrylabel count(*) as N
Aland Finland 1
American Samoa United States of America 1
Antarctica   29
Argentina Bolivia 1
China People's Republic of China 307
Congo (Brazzaville) Republic of the Congo 14
Congo (Kinshasa) Democratic Republic of the Congo 74
Curacao Curaçao 1
Falkland Islands   1
French Polynesia France 1
Gabon Equatorial Guinea 1

This is much bigger project than I expected, and still lot of problems on the wikidata side. And linking NE with OSM via wikidata-id is ( and fixing the problems) is an another 2-4 weeks project ( maybe the NE4.1? )

Great research here! Yes, I'd like to fix all the GeoNames.org concordances up too. Agree this is a v4.1 milestone task. Tracking with https://github.com/nvkelso/natural-earth-vector/issues/220.

nvkelso commented 6 years ago

(I will probably add 2 more adm1 region capitals in Belgium for the v4 series but I can do those manually.)

nvkelso commented 6 years ago

@ImreSamu Any more progress to report on Wikipedia concordances here as Friday, Sept 29th approaches? Cheers!

ImreSamu commented 6 years ago

@nvkelso : Ohh Sorry, I am working on this topic!

now: I am fine tunning my algorithm, and fixing low hanging wikidata problems

You can expect first big output ( and longer answer ) in this weekend.

nvkelso commented 6 years ago

That's great news, thanks for the update! :)

On Sep 16, 2017, at 13:15, ImreSamu notifications@github.com wrote:

@nvkelso : Ohh Sorry, I am working on this topic!

now: I am fine tunning my algorithm, and fixing low hanging wikidata problems

You can expect first big output ( and longer answer ) in this weekend.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ImreSamu commented 6 years ago

ne_10m_populated_places (7343) - wikidata status [gentime2017-09-17]

https://docs.google.com/spreadsheets/d/1SmAcOZ1O6y-RF30C7ni-KEzcr3MEcDZvYwJqHYX_d3E/edit?usp=sharing This is a validation sheet of the (ne_10m_populated_places 'Date: Tue Sep 5 00:56:29 2017 -0700 commit 38685cd527a858b03740829ee9f75ebe78dc2829' ) content (7343rec)

_quick_status FREQ my comment
DEL-Disambiguation 32 wiki Disambiguation pages , can be removed
DEL-No en/de/es/fr/pt/ru/zh wiki page 37 no wiki page, can be removed
DEL-No location (lat,lon) on wikidata 6 no location information on the wikidata, can be removed
DEL/WARN Distance>50km and country diff 6 probably wrong matches
DEL/WARN Extreme distance >500km 5 extreme wd_distance value
DEL/WARN Extreme distance 100-499km 8 ....
WARN Extreme distance 50- 99km 11 .....

wd_location = wikidata location, can be more than one wd_distance = SPARQL calculated average distance (km)

in the next version I will add other checks:

other status

I am validating the algorithm , probably tommorrow I can give you the first part of the new matches ..

now according to current validation status :

_mstatus | FREQ

F1_OK | 6577 F2_GOOD | 368 F3_MEDIUM | 68
F4_MAYBE| 39 ~ need extreme manual check F9_BAD | 266 ~ I have no matches ...

So probably about ~300 records will be without wikidataid,
Problematic areas: China, Russia, Congo (Kinshasa), Kazakhstan, ...

nvkelso commented 6 years ago

Thanks for the update! :)

On Sep 17, 2017, at 19:03, ImreSamu notifications@github.com wrote:

ne_10m_populated_places (7343) - wikidata status [gentime2017-09-17]

https://docs.google.com/spreadsheets/d/1SmAcOZ1O6y-RF30C7ni-KEzcr3MEcDZvYwJqHYX_d3E/edit?usp=sharing This is a validation sheet of the (ne_10m_populated_places 'Date: Tue Sep 5 00:56:29 2017 -0700 commit 38685cd' ) content (7343rec)

_quick_status FREQ my comment DEL-Disambiguation 32 wiki Disambiguation pages , can be removed DEL-No en/de/es/fr/pt/ru/zh wiki page 37 no wiki page, can be removed DEL-No location (lat,lon) on wikidata 6 no location information on the wikidata, can be removed DEL/WARN Distance>50km and country diff 6 probably wrong matches DEL/WARN Extreme distance >500km 5 extreme wd_distance value DEL/WARN Extreme distance 100-499km 8 .... WARN Extreme distance 50- 99km 11 ..... wd_location = wikidata location, can be more than one wd_distance = SPARQL calculated average distance (km)

in the next version I will add other checks:

country name diffs - checking geonameid diffs - checking other status

I am validating the algorithm , probably tommorrow I can give you the first part of the new matches ..

now according to current validation status :

_mstatus | FREQ

F1_OK | 6577 F2_GOOD | 368 F3_MEDIUM | 68 F4_MAYBE| 39 ~ need extreme manual check F9_BAD | 266 ~ I have no matches ...

So probably about ~300 records will be without wikidataid, Problematic areas: China, Russia, Congo (Kinshasa), Kazakhstan, ...

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ImreSamu commented 6 years ago

@nvkelso :

imho: this can be imported: New 243 matches
https://docs.google.com/spreadsheets/d/12ljwgq03n4z_uFoWReeMWp2TK15xlXqW73JkIkAKC_0/edit?usp=sharing

I have checked manually the extreme name differences ( see ImreSamu_comment column )

I have found:

I am working the next data packages ...

nvkelso commented 6 years ago

amalgamated cities

This doesn't surprise me – Canada is still undergoing amalgamation process. See even Toronto and the craziness with their mayor a few years back ;)

Moloundou, Cameroon

Yes, this place lat/lng needs to move north of the river and then the topology will also be corrected. That's on me. Thanks!

some distance ( 'wd_distance' ) is extreme > 80 Km

Also doesn't surprise me but I'll review the most crazy ones and update the NE lat/lngs. NE was built by hand before the geospatial revolution.

and sometimes no english wikipedia page - only espanol or russian, ...

That's okay, also doesn't surprise me because of Natural Earth's coverage.

I am working the next data packages ...

I'm looking forward to the data packages! 😄

nvkelso commented 6 years ago

Please check Nacozari de García, Mexico

I agree with this match. The NE name should change.

nvkelso commented 6 years ago
ImreSamu commented 6 years ago

added a new sheet "another_new_p2" ( 20 problematic matches - debugged ) I have added the alternative wikidata id to the "ImreSamu_comment" column.

you can expect at least 2 more sheets ...

Comments:

Municipality vs City/Town problem : probably the version used in the OSM is the better choice.

North Korea

France:

India:

Pec | Serbia

according to the name:

According the distance: https://www.wikidata.org/wiki/Q208038 Čačak

Turnovo|Bulgaria

ImreSamu commented 6 years ago

"another_update_p3" sheet done

manually verified , some feedback

"Tel Aviv-Yafo Municipality " : https://www.wikidata.org/wiki/Q12410321 - no location information

Obando | columbia

ImreSamu commented 6 years ago

I am finished this batch - the result: 3 sheets

I have planned an another (4.) sheet, but this was mostly a "Municipality vs City/Town problem" again, and I realized that I need more time to analyze deeper.

The next package - expected in ~friday or ~monday.

nvkelso commented 6 years ago

Great! I'll integrate this batch tonight :)

On Sep 19, 2017, at 19:38, ImreSamu notifications@github.com wrote:

I am finished this batch - the result: 3 sheets

I have planned an another (4.) sheet, but this was mostly a "Municipality vs City/Town problem" again, and I realized that I need more time to analyze deeper.

The next package - expected in ~friday or ~monday.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nvkelso commented 6 years ago

p2 sheet

This one is a mixed bag. I've accepted only 1/3rd of the comment IDs, original new IDs were preferred. Below are the ones I didn't accept the comments on and sometimes why.

p3 sheet

OMG: Stuff like Honda car maker vs. Honda city will be the death of me. See also Door.

For Tel Aviv, my earlier comment:

A parallel Wikidata ID example of admin feature versus it's capital (unitary or otherwise) is Tel Aviv Q33935 and Tel Aviv-Yafo Municipality Q12410321). In this case the Natural Earth name is Tel Aviv-Yafo but your suggestion for Q33935 results in many more Wikipedia links with all the translations (but all missing -Yafo). This is really splitting hairs 😉 Generally Natural Earth favors the agglomeration name if it's a formal governing body (another example is the Gold Coast in Australia). If the use-case is linking up with OpenStreetMap's http://www.openstreetmap.org/relation/1382494 they list Q33935 but the names and the polygon shape in OSM suggest that's in error and it should be Q12410321 instead so I'm going to leave it. Anyhow... I'm only manually reviewing the top 10 of these that have an import scale rank of 2 or 3, and Tel Aviv-Yafo is the only one I take issue with. Sampling of the remainder I don't see any suspicious candidates :)

But if the goal is translations and OSM concordance I suppose you're right in the practical sense. I'll change it to Q33935.

nvkelso commented 6 years ago

The following commits catch us up to the 3 tabs in the spreadsheet:

This leaves 264 features without Wikidata IDs (or 3.5% missing of total)

Only 4 of those are min_zoom < 6:

NAME ADM0NAME ADM1NAME LATITUDE LONGITUDE POP_MAX GEONAMEID min_zoom wikidataid wof_id
Dulan China Gansu 36.1665895783 98.2666011139 100 -1 5.6   1141909221
Houma China Shanxi 35.6199821157 111.20999711 102400 -1 5.1   1141909247
Dire Dawa Ethiopia Dire Dawa 9.5899947296 41.8600182686 252279 338832 5.6   421192777
Santa Cruz Ecuador Gal -0.5333150036 -90.3499996356 11262 -1 5.6   1141909231
nvkelso commented 6 years ago

Besides the 4 listed above with min_zoom < 6 (with almost all others are in zoom 6), what more remains for us here in the v4.0 milestone?

There was reference above to:

T4: and if everything is ok: importing other validated unicode-names. ( at least more ~ 530 )

Can you tell me more about what that would entail?

In any event, by next Monday is okay as I'm not going to work on this over the weekend. But then I will need to focus just on the release management pieces for v4.0.

ImreSamu commented 6 years ago

next Monday is okay

ok ,

what more remains for us here in the v4.0 milestone?

my plan:

0. regenerating the full7343_records_sheet - like this old

1. checking extreme values

As I see there are 18 bad wikidata ids. my plan:

extreme errors ----
DEL-Disambiguation 15
DEL-No location (lat,lon) on wikidata 3

2. some prepare for importing other validated unicode-names.

probably I will add a new column ( _unicode_name_update ) with this rules

        if ne_nameascii == ne_name 
           and ne_name == unidecode.unidecode(wd_label)  
           and ne_name != wd_label 
           and wd_distance < 20:
            _unicode_name_update = wd_label 
        else:
           _unicode_name_update = ''     

to the full7343_records_sheet , and you can import what you want. And if you reject any reason - no problem for me.

3. re-checking min_zoom < 6

nvkelso commented 6 years ago

Sounds good :)

Please include the ascii (Unicode decoded) name in column so i can use that for any new features.

_n

On Sep 20, 2017, at 19:29, ImreSamu notifications@github.com wrote:

next Monday is okay

ok ,

what more remains for us here in the v4.0 milestone?

my plan:

  1. regenerating the full7343_records_sheet - like this old

  2. checking extreme values

As I see there are 18 bad wikidata ids. my plan:

If min_zoom < 6 : manually fix If min_zoom >= 6: remove extreme errors ---- DEL-Disambiguation 15 DEL-No location (lat,lon) on wikidata 3

  1. some prepare for importing other validated unicode-names.

probably I will add a new column ( _unicode_name_update ) with this rules

    if ne_nameascii == ne_name 
       and ne_name == unidecode.unidecode(wd_label)  
       and ne_name != wd_label 
       and wd_distance < 20:
        _unicode_name_update = wd_label 
    else:
       _unicode_name_update = ''     

to the full7343_records_sheet , and you can import what you want. And if you reject any reason - no problem for me.

  1. re-checking min_zoom < 6

Manually find the missing wikidaid -s .. ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ImreSamu commented 6 years ago

The last package: google-spreadsheets:ne-wikidata-2017-09-25

Status sheet:

Proposed changes ( 3 sheet ) :

nvkelso commented 6 years ago

@ImreSamu What do you think about applying the same WikidataID logic to the admin0 "countries" (around 400 features) and admin1 "states" layers (around 4,000 features) in Natural Earth?

This would make it easier to link up those min_label properties for projects like OpenMapTiles. I could open a new issue for that.

I'll likely finish the updating the populated places suggestions in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-332065797 over the weekend and do the 4.0 release early next week.

ImreSamu commented 6 years ago

What do you think about applying the same WikidataID

I need a little research, but in theory, we can add Wikidataid for everywhere :

in practice: it is not so easy.

my gut feelings: the admin0 "countries" is the easiest parts. But the admin1 can be very hard ( for example for Africa, Asia and for Central and South America) ;

not impossible, just hard ...

I'll likely finish the updating the populated places suggestions

thanks :) imho: the wikidataid(populated places) quality is so much better, but not perfect. I hope, that I can fix the remaining problems for the 4.1 release. So please be very careful about the quality statements in the release notes! :)

nvkelso commented 6 years ago

Yeah, the adm0 countries would be most useful, and seems manageable quanity to start with :)

On Fri, Sep 29, 2017 at 10:45 AM, ImreSamu notifications@github.com wrote:

What do you think about applying the same WikidataID

I need a little research, but in theory, we can add Wikidataid for everywhere :

  • admin0 "countries"
  • admin1 "states"
  • parks
  • lakes
  • ports
  • rivers
  • islands
  • ....

in practice: it is not so easy.

my gut feelings: the admin0 "countries" is the easiest parts. But the admin1 can be very hard ( for example for Africa, Asia and for Central and South America) ;

not impossible, just hard ...

I'll likely finish the updating the populated places suggestions

thanks :) imho: the wikidataid(populated places) quality is so much better, but not perfect. I hope, that I can fix the remaining problems for the 4.1 release. So please be very careful about the quality statements in the release notes! :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-333192735, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0EO5y0XTydxC59Erv8BjVZc0yx3CMSks5snSy9gaJpZM4O8X6w .

nvkelso commented 6 years ago

After making changes for https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-332065797 ending with commit ccfe9ba731e17b90e35528fb25162d995c7a4fc7, the updated stats are:

This leaves 246 features without Wikidata IDs (or 3.3% missing of total)!

Only 3 of those are min_zoom < 6:

NAME ADM0NAME ADM1NAME LATITUDE LONGITUDE POP_MAX GEONAMEID min_zoom wikidataid wof_id
Jinxi China Liaoning 40.750340799 120.829978393 2426000 2036434 5.6   890512899
Dulan China Gansu 36.1665895783 98.2666011139 100 -1 5.6   1141909221
Santa Cruz Ecuador Gal -0.5333150036 -90.3499996356 11262 -1 5.6   1141909231

Note that I had a little trouble with a few features that started with a ' in their names, or other accent marks but overall it was less than 10 problematic features I manually fixed in the join.

I've also added the wikidata labels as NAME_* for (base of 7343 for all zooms, and 1186 for low zooms 0, 1, 2, 3, 4, and 5):

@ImreSamu I think this closes out this Github issue. We can discuss further Wikidata concordance work in the new #224.

nvkelso commented 6 years ago

Looks like of many of the Chinese names include city at the end with the character which is technically correct for the wikipedia pages since they often are the city of a regional district by the same name, but we don't want to see that part labeled on the map so I'm going to remove them in QGIS with:

replace("NAME_ZH",'市','')

There were 585 places with a terminal that were stripped. For example Beijing the capital of China is now correctly just 北京 instead of 北京市.

nvkelso commented 6 years ago

Another possible revision is which means "area" or "district", for example 黄岩区 south of Shanghai.

Baidu and AutoNavi do show the names inclusive , but on zoom in those places receive a different label treatment (blue box) – may not be cities on their own or special localadmin districts. At any rate: no change for them. AutoNavi uses the same styling for San Francisco, fwiw.

nvkelso commented 6 years ago

For followup in 07d52359f62475a4a66883225d4862001915409e to remove more bunk (disambiguation) and , disambiguation values from NAME_* columns.