Closed nvkelso closed 6 years ago
The bulk of this was completed in 14cd3060fe5d347262c438d849b79070bfbdc3ce by joining against GeoNamesID
in NE to gn:id
in WOF concordance file. The wof_id
and wikidataid
were copied over for ~ 5800
features.
Touch-ups for zooms 2, 3, and 4 were completed in 5fa8401e16cf7e98615e74ec5f57570adfb095f0. There were somewhere between 20 and 50 of these. In some cases the GeoNamesID in NE was bad, in some other cases the gn:id in WOF was bad (like Honolulu), in other cases the feature doesn't exist in WOF (yet).
Current status:
5815
features have wof_id concordance now (78%)5430
features have wikidataid concordance now (74%)394
features have a wof_id but not a wikidata concordance9
features do not have a wof_id but do have a wikidata concordanceNotes on future work:
Due to count over 1000, an semi-automated approach is needed for the remainder.
117
features improved1042
310
9
41
Next step is probably running the whole set thru the python scripts in https://github.com/mapzen-data/wikipedia-notebooks to verify the existing work and fill in the gaps.
QA: After re-thinking the problem, I started with Wikidata geosparql based queries ( so no licensing issues)
This is a first example report, based on the latest commit :
There are 403 new matches via geonameid
to wof_id
via a file from @stepps00 based on his remainder import of Natural Earth places into WOF a few weeks ago. But a few of those are suspect based on work I did last night so only 395 features wof_id
were copied over via 6a6c17518f9f0fa7611f2a6770885b94f4728699.
@ImreSamu 2.5% error rate isn't bad! :) I've spot checked your list and I agree with the changes. I've made them with 143 changed wikidataid
concordances via 51851a43090f871151c40cb6a9565beefdda8a9d.
(The FID isn't necc. stable so I did a join based on ~ name & admin0 & the_old_wikidataid
.)
838 places added wof_id via 3529fe5a2583bdde1dc3c63514105f4e37a5f890, in addition to the 403 in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324217775. This is now 100% complete for wof_id
concordances. (But we remain 5430 features have wikidataid
concordance at 74% concordance.)
There look to be 13 places that need revision in Who's On First (they look like new IDs for existing place IDs). I think the wof_id on the left is good, and the one on the right should be superseded into the one on the left. /cc @stepps00
@nvkelso: Thanks! :)
I have created a new wikidataid
proposals ( 504 new wikidataid
) :
The other missing 1409 is not so easy to match. As I see, the main problem categories:
EDIT:
@ImreSamu can you share the Wikidata SPARQL queries here, please, so we have documentation if it's needed again in the future?
Current status: 5430
+ 504
= 5934 with wikidata id concordance out of 7343 total = 81%.
Of the 1409 that are still missing concordance, perhaps @ImreSamu can harvest another 600 per https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324396862.
In the meantime I've been looking at Olga's work from last summer (per https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-323953779). I've gotten it to run, but I seem to be missing the part where name
+ adm0name
is searched for, or name
+ adm1name
. Notwithstanding that...
I'm on vacation starting Friday and the following week. I'll post an update tomorrow and then be away from the computer but on email.
Posting my intermediate results here as a GIST: https://gist.github.com/nvkelso/06393fcfda298c98571bb3d3a3845e8c.
I finished manually reviewing the top 550 sometimes false negatives. The positives after that I spot checked a lot less. They mostly seem right, except some are for disambiguation pages and rarely for a different place all-together.
@ImreSamu maybe you can run these candidate thru the same process to determine which you think are more or less valid, too?
@nvkelso
You can find my NaturalEarth vs. Wikidata QA codes here: github.com/ImreSamu/natural-earth-vector-qa
I have created a simple scoring system based on
_geonames_status=OK
Current status:
_status | _wikidata_status | _geonames_status | n |
---|---|---|---|
S1-Very good match ( _score > 120) | 66 | ||
S1-Very good match ( _score > 120) | OK | 160 | |
S1-Very good match ( _score > 120) | DIFF | 18 | |
S1-Very good match ( _score > 120) | DIFF | OK | 127 |
S1-Very good match ( _score > 120) | EQ | 631 | |
S1-Very good match ( _score > 120) | EQ | OK | 4359 |
S2-Good match ( 90 - 120) | 467 | ||
S2-Good match ( 90 - 120) | OK | 58 | |
S2-Good match ( 90 - 120) | DIFF | 58 | |
S2-Good match ( 90 - 120) | DIFF | OK | 21 |
S2-Good match ( 90 - 120) | EQ | 10 | |
S2-Good match ( 90 - 120) | EQ | OK | 223 |
S3-Maybe ( 40 - 90) | 174 | ||
S3-Maybe ( 40 - 90) | OK | 75 | |
S3-Maybe ( 40 - 90) | DIFF | 25 | |
S3-Maybe ( 40 - 90) | DIFF | OK | 18 |
S3-Maybe ( 40 - 90) | EQ | 16 | |
S3-Maybe ( 40 - 90) | EQ | OK | 139 |
S4-Not found in wikidata ( score < 40) | 416 | ||
S4-Not found in wikidata ( score < 40) | DIFF | 301 | |
S4-Not found in wikidata ( score < 40) | DIFF | OK | 17 |
S4-Not found in wikidata ( score < 40) | EQ | 1 |
_status | n |
---|---|
S1-Very good match ( _score > 120) | 5361 |
S2-Good match ( 90 - 120) | 837 |
S3-Maybe ( 40 - 90) | 447 |
S4-Not found in wikidata ( score < 40) | 735 |
comment for 735 cases in S4-Not found in wikidata ( score < 40) |735
wikidataid
candidate,
wikidataid
in the natural-earth database, but my current SPARQL query is not found them. ( TODO ) ; Hi @nvkelso and @ImreSamu
It is great to see the efforts to add WikiDataIDs to Natural Earth Data.
I would like to contribute here with the manually verified conflicting or missing links between Natural Earth Data and related Wikipedia pages, if it helps you in the process.
These were done by students for Klokan Technologies about 5 years back (on very old Natural Earth Data and in time when WikiData were in infancy). The students have spent about a week on finding the proper Wikipedia articles for the records which were not linkable automatically (with defined Levenshtein distance and geographic location verification). The data is probably not usable directly, but they may help you save a bit of time on the cases which can't be automated:
https://gist.github.com/klokan/3d6d97c3d95856b18b8dcde81fe69e1b
https://gist.github.com/klokan/4e0800bcc04781e2c56cf57fc1e41b07
We waive all the copyrights on this - it is freely reusable. If you find it helpful it would be kind to mention KlokanTech on the announcements of results. We are just trying to help.
Current status:
_wd_match_wikidataid_diffs
- updates (229 )_wd_match_wikidataid_new
- new wikidataid (885 )Columns:
Important comment:
I try to continue the testing ...
@nvkelso: Thank you for your GIST file, I have used in my manual testing, It is already helped a lot for detecting programming errors, better SPARQL filters, etc.
@ImreSamu wow, those are epic Wikidata threads! Thanks for sharing them, I can sympathize and feel their pain working with just a small selection in Natural Earth and a more comparable set in the Who's On First gazetteer.
Here's a good example, the 2nd in your diff for Targoviste, Romania:
ceb
related bot) Your newly proposed wikidata ID is for a page with lots of translations and a good English Wikipedia page. The ceb
related one seems to be a "mirror world" GeoNames related import that's really a duplicate of the existing Wikidata feature.
Here's another for Beringovskiy, Russia:
Here the existing NE wikidata ID is for a disambiguation page, where your proposal is for the real place. Yeah!
Final example Douglass, Isle of Man:
Your proposal is actually in Isle of Man, the existing wikidata ID is for an obviously incorrect place in Scotland.
I'm noticing a trend that the lower the Wikidata ID value is probably the correct ID value. Might be useful to you when determining a tie-breaker in otherwise similar candidates.
I checked a few GeoNames.org concordances and I'm mixed about those. For instance, I've seen links to the admin feature instead of the capital of the same feature (often they are "unitary" features). I'm going to import them only in cases where NE doesn't already have a GN_ID.
A parallel Wikidata ID example of admin feature versus it's capital (unitary or otherwise) is Tel Aviv Q33935 and Tel Aviv-Yafo Municipality Q12410321). In this case the Natural Earth name is Tel Aviv-Yafo but your suggestion for Q33935
results in many more Wikipedia links with all the translations (but all missing -Yafo
). This is really splitting hairs 😉 Generally Natural Earth favors the agglomeration name if it's a formal governing body (another example is the Gold Coast in Australia). If the use-case is linking up with OpenStreetMap's http://www.openstreetmap.org/relation/1382494 they list Q33935
but the names and the polygon shape in OSM suggest that's in error and it should be Q12410321
instead so I'm going to leave it. Anyhow... I'm only manually reviewing the top 10 of these that have an import scale rank of 2 or 3, and Tel Aviv-Yafo is the only one I take issue with. Sampling of the remainder I don't see any suspicious candidates :)
(The WD population is useful, but since we don't have a complete set I'm going to ignore importing it at all now but look forward to retaining it in your future analysis.)
I think it's worth while to import your diffs
and new
now. If one of them turns out to be funky / superseded in the whole Cebuano fix-up then we can update NE later (it's probably going to take them a while to work thru their funk, and your work is "better than yesterday" so let's carry if forward to tomorrow).
The import is in two commits:
wikidataid
@klokan Thanks for the GISTs in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-326575186. I'm not going to do anything with them immediately, but they have good content that may be useful later :)
@ImreSamu I'm happy where this is for a v4 Natural Earth release now. This does have implications that these 525 features wouldn't get included / translated depending on the join technique so that will need to be managed.
There are only 18 features at zoom 5 missing the Wikidata IDs, 378 at zoom 6, 116 at zoom 7, and 13 at zoom 8+. Most maps only translate names at zooms 0 to 6 so this is pretty good! 😄
It's probably more than a couple weeks work to sort thru this set to determine if OSM even has Wikidata links for them, if the Wikidata ID pages exist, & etc. (My earlier estimate in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-324552142 was 259 missing minimum versus 525 actual now, so a chunk of work either way.) I've got a few other tasks for the v4 release to knock out so I'm going to move on to those now.
I could be persuaded otherwise if you see there is more to gain based on your earlier OpenMapTiles joins. What do you think?
@nvkelso Thank you for accepting my suggestions.
Probably I need more time to finish the current checking ( about 1-2 weeks )
My planned to-do list for ~ NE v4.0
T1: the latest _new and the _diff list based on high score matches. But some type of validation still missing for the other "low score" or "not matched wikidataid" values - imported in the early times. ( ~ 228 ). At least I would like detecting and cleaning the remain :
T2: (minimal) checking ( only the low hanging fruits )
T3: Checking the missing wikidataid for zoom 0-6
T4: and if everything is ok: importing other validated unicode-names. ( at least more ~ 530 )
If we have a strict deadline for v4.0 then I need to re-think the priority. ( probably the T1, T4 is the most important for the next milestone ) imho: the current biggest business value will be the unicode-name fixes, not the wikidata-ids.
This is much bigger project than I expected, and still lot of problems on the wikidata side. And linking NE with OSM via wikidata-id is ( and fixing the problems) is an another 2-4 weeks project ( maybe the NE4.1? )
other comments:
I am using the MAX(population) in my SPARQL query but this is not always the latest value, so I agree to not importing this values (now)
"GeoNames": I have detected lot of differences,
My favorite test case for bad matching: "2074#Niagara Falls#United States of America#New York#Niagara Falls" has a same geonameid=6087892 with /Niagara_Falls,_Ontario/Canada
In terms of timing I'd like to release on or just before the Montréal NACIS meeting which is 33 days away. I'll gladly take more contributions from you until Friday, Sept 29th, but after that I'll only be focused on packaging the release.
Yes, I'm happy for even more followup in a v4.1 issue with more changes! (See below for followup discussion and issues.) Thanks for all your help with this 😄
If we have a strict deadline for v4.0 then I need to re-think the priority. ( probably the T1, T4 is the most important for the next milestone )
I'd do it T4, T1, T2, T3 personally but whatever works for you 😉
imho: the current biggest business value will be the unicode-name fixes, not the wikidata-ids.
If you provide a 100% coverage table (all 7343 features) with the unicode changes those are easy for me to just accept and change the default NE name (since the nameascii is already there).
Some of the other fuzzy name matches I'd like to track for researching and fixing in the v4.1 milestone. Sometimes it's a transliteration stylistitic difference (like with Russian), sometimes it's because there is a , Countryname
or similar post appended, sometimes it's a Spanish name with a crazy long formal name but the conventional name is shorter, sometimes former USSR place where they've redone all the names. All those take time to evaluate, accept, and ensure the old name is stored into nameparenthetical or namealt. Tracking this with: https://github.com/nvkelso/natural-earth-vector/issues/219.
Per T2 comment above for https://gist.github.com/ImreSamu/84d3603ef4cbf14a0550cdd8491531b2:
ne_adm0name | wd_countrylabel | count(*) as N |
---|---|---|
Aland | Finland | 1 |
Does this mean that there is 1 place that NE says is in Aland but Wikidata says is in Finland? That's someone explainable since Aland is a special region of Finland. Could be a subtle error, though.
Some of these are easy explainable (like Congo, China, and Curaçao). Others like Argentina/Bolivia, Gabon/Equitorial Guinea may be errors. The construction of the ne_adm0name value was a spatial join a long time ago between the populated places points and the NE 1:10,000,000 country themes with manual fixes over the years. Some options:
For the first case there is a town on the Germany-Austria border that I fixed a couple days ago in the v4 series by moving it's lat/lng a tiny bit north-west. I can take care of edits like that if you point them out to me.
(bigger snippet of table below)
ne_adm0name | wd_countrylabel | count(*) as N |
---|---|---|
Aland | Finland | 1 |
American Samoa | United States of America | 1 |
Antarctica | 29 | |
Argentina | Bolivia | 1 |
China | People's Republic of China | 307 |
Congo (Brazzaville) | Republic of the Congo | 14 |
Congo (Kinshasa) | Democratic Republic of the Congo | 74 |
Curacao | Curaçao | 1 |
Falkland Islands | 1 | |
French Polynesia | France | 1 |
Gabon | Equatorial Guinea | 1 |
This is much bigger project than I expected, and still lot of problems on the wikidata side. And linking NE with OSM via wikidata-id is ( and fixing the problems) is an another 2-4 weeks project ( maybe the NE4.1? )
Great research here! Yes, I'd like to fix all the GeoNames.org concordances up too. Agree this is a v4.1 milestone task. Tracking with https://github.com/nvkelso/natural-earth-vector/issues/220.
(I will probably add 2 more adm1 region capitals in Belgium for the v4 series but I can do those manually.)
@ImreSamu Any more progress to report on Wikipedia concordances here as Friday, Sept 29th approaches? Cheers!
@nvkelso : Ohh Sorry, I am working on this topic!
now: I am fine tunning my algorithm, and fixing low hanging wikidata problems
You can expect first big output ( and longer answer ) in this weekend.
That's great news, thanks for the update! :)
On Sep 16, 2017, at 13:15, ImreSamu notifications@github.com wrote:
@nvkelso : Ohh Sorry, I am working on this topic!
now: I am fine tunning my algorithm, and fixing low hanging wikidata problems
You can expect first big output ( and longer answer ) in this weekend.
― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
https://docs.google.com/spreadsheets/d/1SmAcOZ1O6y-RF30C7ni-KEzcr3MEcDZvYwJqHYX_d3E/edit?usp=sharing This is a validation sheet of the (ne_10m_populated_places 'Date: Tue Sep 5 00:56:29 2017 -0700 commit 38685cd527a858b03740829ee9f75ebe78dc2829' ) content (7343rec)
_quick_status | FREQ | my comment |
---|---|---|
DEL-Disambiguation | 32 | wiki Disambiguation pages , can be removed |
DEL-No en/de/es/fr/pt/ru/zh wiki page | 37 | no wiki page, can be removed |
DEL-No location (lat,lon) on wikidata | 6 | no location information on the wikidata, can be removed |
DEL/WARN Distance>50km and country diff | 6 | probably wrong matches |
DEL/WARN Extreme distance >500km | 5 | extreme wd_distance value |
DEL/WARN Extreme distance 100-499km | 8 | .... |
WARN Extreme distance 50- 99km | 11 | ..... |
wd_location = wikidata location, can be more than one wd_distance = SPARQL calculated average distance (km)
in the next version I will add other checks:
I am validating the algorithm , probably tommorrow I can give you the first part of the new matches ..
now according to current validation status :
F1_OK | 6577
F2_GOOD | 368
F3_MEDIUM | 68
F4_MAYBE| 39 ~ need extreme manual check
F9_BAD | 266 ~ I have no matches ...
So probably about ~300 records will be without wikidataid,
Problematic areas: China, Russia, Congo (Kinshasa), Kazakhstan, ...
Thanks for the update! :)
On Sep 17, 2017, at 19:03, ImreSamu notifications@github.com wrote:
ne_10m_populated_places (7343) - wikidata status [gentime2017-09-17]
https://docs.google.com/spreadsheets/d/1SmAcOZ1O6y-RF30C7ni-KEzcr3MEcDZvYwJqHYX_d3E/edit?usp=sharing This is a validation sheet of the (ne_10m_populated_places 'Date: Tue Sep 5 00:56:29 2017 -0700 commit 38685cd' ) content (7343rec)
_quick_status FREQ my comment DEL-Disambiguation 32 wiki Disambiguation pages , can be removed DEL-No en/de/es/fr/pt/ru/zh wiki page 37 no wiki page, can be removed DEL-No location (lat,lon) on wikidata 6 no location information on the wikidata, can be removed DEL/WARN Distance>50km and country diff 6 probably wrong matches DEL/WARN Extreme distance >500km 5 extreme wd_distance value DEL/WARN Extreme distance 100-499km 8 .... WARN Extreme distance 50- 99km 11 ..... wd_location = wikidata location, can be more than one wd_distance = SPARQL calculated average distance (km)
in the next version I will add other checks:
country name diffs - checking geonameid diffs - checking other status
I am validating the algorithm , probably tommorrow I can give you the first part of the new matches ..
now according to current validation status :
_mstatus | FREQ
F1_OK | 6577 F2_GOOD | 368 F3_MEDIUM | 68 F4_MAYBE| 39 ~ need extreme manual check F9_BAD | 266 ~ I have no matches ...
So probably about ~300 records will be without wikidataid, Problematic areas: China, Russia, Congo (Kinshasa), Kazakhstan, ...
― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@nvkelso :
imho: this can be imported: New 243 matches
https://docs.google.com/spreadsheets/d/12ljwgq03n4z_uFoWReeMWp2TK15xlXqW73JkIkAKC_0/edit?usp=sharing
I have checked manually the extreme name differences ( see ImreSamu_comment
column )
I have found:
ne_adm0name
- "Moloundou" probably in CameroonI am working the next data packages ...
amalgamated cities
This doesn't surprise me – Canada is still undergoing amalgamation process. See even Toronto and the craziness with their mayor a few years back ;)
Moloundou, Cameroon
Yes, this place lat/lng needs to move north of the river and then the topology will also be corrected. That's on me. Thanks!
some distance ( 'wd_distance' ) is extreme > 80 Km
Also doesn't surprise me but I'll review the most crazy ones and update the NE lat/lngs. NE was built by hand before the geospatial revolution.
and sometimes no english wikipedia page - only espanol or russian, ...
That's okay, also doesn't surprise me because of Natural Earth's coverage.
I am working the next data packages ...
I'm looking forward to the data packages! 😄
Please check
Nacozari de García
, Mexico
I agree with this match. The NE name should change.
added a new sheet "another_new_p2" ( 20 problematic matches - debugged ) I have added the alternative wikidata id to the "ImreSamu_comment" column.
you can expect at least 2 more sheets ...
according to the name:
According the distance: https://www.wikidata.org/wiki/Q208038 Čačak
ne_wikidataid
Disambiguation
No en/de/es/fr/pt/ru/zh wiki page
No location (lat,lon) on wikidata
) t"Tel Aviv-Yafo Municipality " : https://www.wikidata.org/wiki/Q12410321 - no location information
Obando | columbia
I am finished this batch - the result: 3 sheets
I have planned an another (4.) sheet, but this was mostly a "Municipality vs City/Town problem" again, and I realized that I need more time to analyze deeper.
The next package - expected in ~friday or ~monday.
Great! I'll integrate this batch tonight :)
On Sep 19, 2017, at 19:38, ImreSamu notifications@github.com wrote:
I am finished this batch - the result: 3 sheets
I have planned an another (4.) sheet, but this was mostly a "Municipality vs City/Town problem" again, and I realized that I need more time to analyze deeper.
The next package - expected in ~friday or ~monday.
― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
This one is a mixed bag. I've accepted only 1/3rd of the comment IDs, original new IDs were preferred. Below are the ones I didn't accept the comments on and sometimes why.
Q899515
has many translations and the other one only has en
, plus it's English page just redirects to Q899515
.Q1353228
is the capital of the commented ID, has more translations, etc.OMG: Stuff like Honda car maker vs. Honda city
will be the death of me. See also Door
.
Obando
to Inírida in Colombia and linking as Q130281.For Tel Aviv, my earlier comment:
A parallel Wikidata ID example of admin feature versus it's capital (unitary or otherwise) is Tel Aviv Q33935 and Tel Aviv-Yafo Municipality Q12410321). In this case the Natural Earth name is Tel Aviv-Yafo but your suggestion for Q33935 results in many more Wikipedia links with all the translations (but all missing -Yafo). This is really splitting hairs 😉 Generally Natural Earth favors the agglomeration name if it's a formal governing body (another example is the Gold Coast in Australia). If the use-case is linking up with OpenStreetMap's http://www.openstreetmap.org/relation/1382494 they list Q33935 but the names and the polygon shape in OSM suggest that's in error and it should be Q12410321 instead so I'm going to leave it. Anyhow... I'm only manually reviewing the top 10 of these that have an import scale rank of 2 or 3, and Tel Aviv-Yafo is the only one I take issue with. Sampling of the remainder I don't see any suspicious candidates :)
But if the goal is translations and OSM concordance I suppose you're right in the practical sense. I'll change it to Q33935.
The following commits catch us up to the 3 tabs in the spreadsheet:
This leaves 264 features without Wikidata IDs (or 3.5% missing of total)
Only 4 of those are min_zoom < 6:
NAME | ADM0NAME | ADM1NAME | LATITUDE | LONGITUDE | POP_MAX | GEONAMEID | min_zoom | wikidataid | wof_id |
---|---|---|---|---|---|---|---|---|---|
Dulan | China | Gansu | 36.1665895783 | 98.2666011139 | 100 | -1 | 5.6 | 1141909221 | |
Houma | China | Shanxi | 35.6199821157 | 111.20999711 | 102400 | -1 | 5.1 | 1141909247 | |
Dire Dawa | Ethiopia | Dire Dawa | 9.5899947296 | 41.8600182686 | 252279 | 338832 | 5.6 | 421192777 | |
Santa Cruz | Ecuador | Gal | -0.5333150036 | -90.3499996356 | 11262 | -1 | 5.6 | 1141909231 |
Besides the 4 listed above with min_zoom < 6
(with almost all others are in zoom 6), what more remains for us here in the v4.0 milestone?
There was reference above to:
T4: and if everything is ok: importing other validated unicode-names. ( at least more ~ 530 )
Can you tell me more about what that would entail?
In any event, by next Monday is okay as I'm not going to work on this over the weekend. But then I will need to focus just on the release management pieces for v4.0.
next Monday is okay
ok ,
what more remains for us here in the v4.0 milestone?
my plan:
As I see there are 18 bad wikidata ids. my plan:
extreme errors | ---- |
---|---|
DEL-Disambiguation | 15 |
DEL-No location (lat,lon) on wikidata | 3 |
probably I will add a new column ( _unicode_name_update
) with this rules
if ne_nameascii == ne_name
and ne_name == unidecode.unidecode(wd_label)
and ne_name != wd_label
and wd_distance < 20:
_unicode_name_update = wd_label
else:
_unicode_name_update = ''
to the full7343_records_sheet , and you can import what you want. And if you reject any reason - no problem for me.
min_zoom < 6
Sounds good :)
Please include the ascii (Unicode decoded) name in column so i can use that for any new features.
_n
On Sep 20, 2017, at 19:29, ImreSamu notifications@github.com wrote:
next Monday is okay
ok ,
what more remains for us here in the v4.0 milestone?
my plan:
regenerating the full7343_records_sheet - like this old
checking extreme values
As I see there are 18 bad wikidata ids. my plan:
If min_zoom < 6 : manually fix If min_zoom >= 6: remove extreme errors ---- DEL-Disambiguation 15 DEL-No location (lat,lon) on wikidata 3
- some prepare for importing other validated unicode-names.
probably I will add a new column ( _unicode_name_update ) with this rules
if ne_nameascii == ne_name and ne_name == unidecode.unidecode(wd_label) and ne_name != wd_label and wd_distance < 20: _unicode_name_update = wd_label else: _unicode_name_update = ''
to the full7343_records_sheet , and you can import what you want. And if you reject any reason - no problem for me.
- re-checking min_zoom < 6
Manually find the missing wikidaid -s .. ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
The last package: google-spreadsheets:ne-wikidata-2017-09-25
Status sheet:
Proposed changes ( 3 sheet ) :
"02-update-unicode-name": 550 names can be updated.
ne_name
: current 'name'
ne_nameascii
: "nameascii"
_unicode_name_update
proposed new unicode names ( fromwd_label
)
"03_new" : 35 proposed new wikidataid
@ImreSamu What do you think about applying the same WikidataID logic to the admin0 "countries" (around 400 features) and admin1 "states" layers (around 4,000 features) in Natural Earth?
This would make it easier to link up those min_label
properties for projects like OpenMapTiles. I could open a new issue for that.
I'll likely finish the updating the populated places suggestions in https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-332065797 over the weekend and do the 4.0 release early next week.
What do you think about applying the same WikidataID
I need a little research, but in theory, we can add Wikidataid for everywhere :
in practice: it is not so easy.
my gut feelings: the admin0 "countries" is the easiest parts. But the admin1 can be very hard ( for example for Africa, Asia and for Central and South America) ;
not impossible, just hard ...
I'll likely finish the updating the populated places suggestions
thanks :) imho: the wikidataid(populated places) quality is so much better, but not perfect. I hope, that I can fix the remaining problems for the 4.1 release. So please be very careful about the quality statements in the release notes! :)
Yeah, the adm0 countries would be most useful, and seems manageable quanity to start with :)
On Fri, Sep 29, 2017 at 10:45 AM, ImreSamu notifications@github.com wrote:
What do you think about applying the same WikidataID
I need a little research, but in theory, we can add Wikidataid for everywhere :
- admin0 "countries"
- admin1 "states"
- parks
- lakes
- ports
- rivers
- islands
- ....
in practice: it is not so easy.
my gut feelings: the admin0 "countries" is the easiest parts. But the admin1 can be very hard ( for example for Africa, Asia and for Central and South America) ;
not impossible, just hard ...
I'll likely finish the updating the populated places suggestions
thanks :) imho: the wikidataid(populated places) quality is so much better, but not perfect. I hope, that I can fix the remaining problems for the 4.1 release. So please be very careful about the quality statements in the release notes! :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-333192735, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0EO5y0XTydxC59Erv8BjVZc0yx3CMSks5snSy9gaJpZM4O8X6w .
After making changes for https://github.com/nvkelso/natural-earth-vector/issues/214#issuecomment-332065797 ending with commit ccfe9ba731e17b90e35528fb25162d995c7a4fc7, the updated stats are:
This leaves 246 features without Wikidata IDs (or 3.3% missing of total)!
Only 3 of those are min_zoom < 6:
NAME | ADM0NAME | ADM1NAME | LATITUDE | LONGITUDE | POP_MAX | GEONAMEID | min_zoom | wikidataid | wof_id |
---|---|---|---|---|---|---|---|---|---|
Jinxi | China | Liaoning | 40.750340799 | 120.829978393 | 2426000 | 2036434 | 5.6 | 890512899 | |
Dulan | China | Gansu | 36.1665895783 | 98.2666011139 | 100 | -1 | 5.6 | 1141909221 | |
Santa Cruz | Ecuador | Gal | -0.5333150036 | -90.3499996356 | 11262 | -1 | 5.6 | 1141909231 |
Note that I had a little trouble with a few features that started with a '
in their names, or other accent marks but overall it was less than 10 problematic features I manually fixed in the join.
I've also added the wikidata labels as NAME_*
for (base of 7343
for all zooms, and 1186
for low zooms 0, 1, 2, 3, 4, and 5):
7076
or 96.36
% (zooms 0-5 is 1182
or 99.66
%)6454
or 87.89
% (zooms 0-5 is 1173
or 98.90
%)6042
or 82.28
% (zooms 0-5 is 1160
or 97.81
%)6806
or 92.69
% (zooms 0-5 is 1175
or 99.07
%)6209
or 84.56
% (zooms 0-5 is 1146
or 96.63
%)6211
or 84.58
% (zooms 0-5 is 1169
or 98.57
%)6086
or 82.88
% (zooms 0-5 is 1152
or 97.13
%)@ImreSamu I think this closes out this Github issue. We can discuss further Wikidata concordance work in the new #224.
Looks like of many of the Chinese names include city
at the end with the 市
character which is technically correct for the wikipedia pages since they often are the city of a regional district by the same name, but we don't want to see that part labeled on the map so I'm going to remove them in QGIS with:
replace("NAME_ZH",'市','')
There were 585 places with a terminal 市
that were stripped. For example Beijing the capital of China is now correctly just 北京
instead of 北京市.
Another possible revision is 区
which means "area" or "district", for example 黄岩区 south of Shanghai.
Baidu and AutoNavi do show the names inclusive 区
, but on zoom in those places receive a different label treatment (blue box) – may not be cities on their own or special localadmin districts. At any rate: no change for them. AutoNavi uses the same styling for San Francisco, fwiw.
For followup in 07d52359f62475a4a66883225d4862001915409e to remove more bunk (disambiguation)
and , disambiguation
values from NAME_*
columns.
Join first to Who's On First based on the common GeoNames concordance and harvest Wikidata IDs from the Who's On First concordances. Verify the result by joining with OpenStreetMap and make one-off edits to fix any funk and fill in the gaps.