sembruk / osm2xmap

Converter from OpenStreetMap data format to OpenOrienteering Mapper format.
GNU General Public License v3.0
8 stars 3 forks source link

fatal error on encountering UTF-8 letters outside ASCII #5

Open matkoniecz opened 8 years ago

matkoniecz commented 8 years ago

Example of synthetic input, based on real causing data error:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (30408 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
 <bounds minlat="50.0530600" minlon="19.8482400" maxlat="50.0545200" maxlon="19.8524600"/>
 <node id="447039358" visible="true" version="1" changeset="1926268" timestamp="2009-07-24T16:08:01Z" user="sledzik1984" uid="58785" lat="50.0541494" lon="19.8488857"/>
 <way id="38042707" visible="true" version="4" changeset="31621045" timestamp="2015-05-31T22:38:07Z" user="dziabaducha" uid="775276">
  <nd ref="447039358"/>
  <nd ref="447039360"/>
  <tag k="highway" v="ą"/>
 </way>
</osm>

to compare, following input differing by replacing "ą" with "footway" is not causing crash:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (30408 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
 <bounds minlat="50.0530600" minlon="19.8482400" maxlat="50.0545200" maxlon="19.8524600"/>
 <node id="447039358" visible="true" version="1" changeset="1926268" timestamp="2009-07-24T16:08:01Z" user="sledzik1984" uid="58785" lat="50.0541494" lon="19.8488857"/>
 <way id="38042707" visible="true" version="4" changeset="31621045" timestamp="2015-05-31T22:38:07Z" user="dziabaducha" uid="775276">
  <nd ref="447039358"/>
  <nd ref="447039360"/>
  <tag k="highway" v="footway"/>
 </way>
</osm>

results in

./osm2xmap -i zoo.osm -s ISSOM_5000.omap 
Using files:
    * input OSM file       - zoo.osm
    * output XMAP file     - ./out.xmap
    * symbol set XMAP file - ISSOM_5000.omap
    * rules file           - ./rules.xml
Segmentation fault (core dumped)

Given that letters like żółćęśąźńŻÓŁĆĘŚĄŹŃ are appearing typically only in tag name that is not rendered in orienteering maps potential band-aid is to process input file and remove UTF-8 letters (obviously, proper solution would allow processing data also with letters beyond ASCII).

Note that such letters may also appear in user field.

sembruk commented 8 years ago

<tag k="highway" v="ą"/>

I don't see any problems. It works.

$ ./osm2xmap -i utf8.osm -s /usr/share/openorienteering-mapper/symbol\ sets/5000/ISSOM_5000.omap 
Using files:
    * input OSM file       - utf8.osm
    * output XMAP file     - ./out.xmap
    * symbol set XMAP file - /usr/share/openorienteering-mapper/symbol sets/5000/ISSOM_5000.omap
    * rules file           - ./rules.xml
Using georeferencing:
    mapScale           0.200000
    declination        0.000000
    grivation          0.000000
    mapRefPoint        (0.000000, 0.000000)
    projectedRefPoint  (417700.873691, 5545244.388228)
    geographicRefPoint (19.850350, 50.053790)
    projectedCrsDesc   '+proj=utm +datum=WGS84 +zone=34'
    geographicCrsDesc  '+proj=latlong +datum=WGS84'
Loading rules 'ISOM2000 adapeted for cyclogaine'... 
WARNING: Symbol with code 401 didn't find
<...>
WARNING: Symbol with code 998 didn't find
WARNING: Symbol with code 998 didn't find
Ok
Converting nodes...
Ok
Converting ways...
WARNING: Node 447039360 didn't find
Ok
Converting relations...
Ok

Execution time: 0.000000 sec.

May be problem in your libroxml build?

matkoniecz commented 8 years ago

May be problem in your libroxml build?

Maybe. What is your libroxml version? I used latest version from their git repository, now I will test latest release (2.3.0).

matkoniecz commented 8 years ago

Or maybe there is some option to download libroxml package (there is no obvious source but...)?

matkoniecz commented 8 years ago

I tested with 2.3.0, without changes.

final strace segment:

read(3, "", 4096)                       = 0
brk(0x9f68000)                          = 0x9f68000
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "<?xml version=\"1.0\" encoding=\"UT"..., 38) = 38
read(3, "<map xmlns=\"http://openorienteer"..., 4096) = 4096
_llseek(3, 4134, [4134], SEEK_SET)      = 0
open("./in.osm", O_RDONLY)              = 4
fstat64(4, {st_mode=S_IFREG|0600, st_size=2065419, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb76df000
read(4, "<?xml version=\"1.0\" encoding=\"UT"..., 4096) = 4096
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x39475} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)

For now I have no idea what may be tested (except making sure we use the same libroxml).

matkoniecz commented 8 years ago

And there is possibility that different environments resulted in differences in what happens. I have 32 bit Ubuntu 14.04.4 LTS (Lubuntu distribution).

matkoniecz commented 8 years ago

Also, can you check whatever libroxml tests are failing for you - https://github.com/blunderer/libroxml/issues/68 ?

kevinhendricks commented 8 years ago

FWIW - that xml file may not be properly utf-8 encoded as that char exists as 1 byte in other encodings. Use a hex editor - not emacs or vim as they guess encoding - to look at that specific char's byte values.

matkoniecz commented 8 years ago

utf-8 is not supported by libroxml - see https://github.com/blunderer/libroxml/issues/63#issuecomment-218504903

Potential solution is to replace libroxml by something that works on more than ASCII or to make horrible workaround like

potential band-aid is to process input file and remove UTF-8 letters (obviously, proper solution would allow processing data also with letters beyond ASCII).

sembruk commented 7 years ago

Potential solution is to replace libroxml by something that works on more than ASCII

In TODO list.