stamen / terrain-classic

World-wide CartoCSS port of Stamen's classic terrain style
ISC License
144 stars 35 forks source link

natural earth data import to pg misses characters #32

Closed clhenrick closed 8 years ago

clhenrick commented 8 years ago

Something didn't go right for me when I re-made the natural earth data / imported to postgres. Maybe try using shp2pgsql instead of ogr2ogr?

screen shot 2015-09-19 at 10 07 28 pm

almccon commented 8 years ago

Good catch. I probably just need to pass the correct encoding to ogr2ogr. I figured I'd try using ogr since we're already using it elsewhere in the Makefile.

clhenrick commented 8 years ago

Ah that makes sense. I seem to always run into issues importing shp data into pg with ogr2ogr so now prefer to use shp2pgsql as it seems more reliable imo.

almccon commented 8 years ago

Can you confirm that this was working previously? I tried to switch to shp2pgsql and I'm getting the same errors.

I wonder if it's a corruption in the Natural Earth files? We had a weird error in the past where there was a strange encoding bug that we couldn't fix. We ended up having to manually edit the .dbf file to get it to have the correct encoding. See here, in our Stamen fork of Natural Earth: https://github.com/stamen/natural-earth-vector/commit/737ce368668f207ce23f30667cead69384d89b5d

almccon commented 8 years ago

There's also some related info here: https://github.com/CartoDB/cartodb/issues/1143

clhenrick commented 8 years ago

Huh, I definitely don't remember seeing it earlier when importing via shp2pgsql. Did you use the encoding flag when you imported it that way?

almccon commented 8 years ago

I saved a modified version of ne_10m_admin_1_states_provinces_scale_rank in our Stamen fork of Natural Earth: https://github.com/stamen/natural-earth-vector/pull/2

I didn't actually modify it at all, I merely opened the file in QGIS and saved it again.

Now with e8752ffc09543b04aeeaa62aa1ba5d4f8c3ef0f3 I can import the file just fine, and the encoding just works.

screen shot 2015-09-22 at 11 47 10 am

clhenrick commented 8 years ago

:+1:

almccon commented 8 years ago

Oops, spoke too soon: LATIN1 characters are working, but other characters are not:

screen shot 2015-09-22 at 1 20 36 pm

http://localhost:8080/flat/#7/14.504/-254.498

almccon commented 8 years ago

And they're broken in the database, not just at the rendering step, so it's still an importing problem:

select name from ne_10m_admin_1_states_provinces_labels where name ILIKE '%ninh%';
    name
------------
 B?c Ninh
 Ninh Bình
 Ninh Thu?n
 Qu?ng Ninh
 Tây Ninh
clhenrick commented 8 years ago

huh, when you imported the natural earth data with shp2pgsql did you try passing that weird windows encoding with the -W flag? I think that's what I did when I imported the data and don't remember having this problem, but could be wrong.

almccon commented 8 years ago

Yup, I did use that flag. Actually, these characters are messed up even when I open them in QGIS... which the LATIN1 characters never were.

screen shot 2015-09-22 at 2 48 42 pm

Perhaps that happened when I re-saved this file... I also see that QGIS has an "encoding" drop-down that you can modify when opening shapefiles... and changing that from "WINDOWS 1252" to "System" changes whether it's scrambled or not. I also just tried downloading the original NE file (not my modified one), and loading it into QGIS with Windows 1252 encoding causes messed up characters, but loading it as utf-8 is fine. And I just tested `shp2pgsql` using that file and `-W "utf-8"` also seems to create non-corrupted tables. So is it possible that this file (unlike the other NE files) actually is utf-8 natively and not Win 1252? Perhaps if I wrote a special-case rule in the Makefile, we could load the original NE file (not the Stamen one) without any problems, if that file used `-W "utf-8"` and the other NE files used `-W "WINDOWS-1252"`. Not sure. Giving up for now. Will revisit later. Or we just cheat and avoid showing admin_1 for most places. I'm sure we could comfortably suppress them for many countries (including Vietnam) without much loss.
clhenrick commented 8 years ago

Very strange. Looking at the Natural Earth Github repo I see that the ne_10m_admin_1_states_provinces are at version 3.0 while the admin 0 data looks like it's at v2.0 -- perhaps the encoding changed to utf-8 with the latest data updates?

almccon commented 8 years ago

Happens with Polish, too: screen shot 2015-11-04 at 11 42 13 am

clhenrick commented 8 years ago

@almccon I forget did we ever check with Nathaniel about this? I know Stamen has their own port of Natural Earth but can't remember why that is.

almccon commented 8 years ago

Stamen has it's own fork so we can make changes and have them available for our map styles. When I make a fix that I'm sure is reliably, I issue them as pull requests: https://github.com/nvkelso/natural-earth-vector/pulls

I know that the fundamental Natural Earth sources are in a Geodatabase, so everything in github is really just derived from that. So even if we make changes to the shapefiles and issue pull requests, someone else (Nathaniel I think) has to make the real changes to the sources.

I also don't fully understand the versioning process with Natural Earth, why some things are on v2 while others are v3.

...but mostly it's just because I haven't found the time to fully understand it.

clhenrick commented 8 years ago

@alan seems like pulling the 10m admin1 scale ranks polygon file from natural earth's website & loading it into QGIS with encoding 'utf-8' fixes the issue:

screen shot 2016-07-01 at 4 35 31 pm
almccon commented 8 years ago

@almccon

Do you have to do anything special when you save it from QGIS?

On Jul 1, 2016, at 16:44, Chris Henrick notifications@github.com wrote:

@alan seems like pulling the 10m admin1 scale ranks polygon file from natural earth's website & loading it into QGIS with encoding 'utf-8' fixes the issue:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.