joto / osmium

C++/Javascript framework for working with OSM files.
http://wiki.openstreetmap.org/wiki/Osmium
GNU General Public License v3.0
123 stars 31 forks source link

osmium_convert produces illegal XML for relation id 240302 version 2 #91

Closed mmd-osm closed 10 years ago

mmd-osm commented 10 years ago

I'm using osmium_convert (osmorg-taginfo-live-5-gab35431) to convert a Germany full history extract osh.pbf to plain osh xml as proposed by Peter in MaZderMind/osm-history-splitter#8

Relation 240302, version 2 caused some serious issue during post-processing the XML result. Obviously, someone managed to put in two special characters in a role name "backward_stop" (see changeset 2919229). This is been corrected in later version of the relation. Nevertheless, if you're dealing with the full history information, you somehow have to handle this.

Processing such an extract with osmium_convert will put those special characters as-is in the output stream. Follow-on processes which depend on valid XML will thus fail: [Note: I've replaced the two special characters by *, as Github won't show them otherwise]

  <relation id="240302" version="2" timestamp="2009-10-22T09:18:37Z" uid="188705" user="Yanisin" changeset="2919229" visible="true">
    <member type="node" ref="494163269" role="forward_stop"/>
    <member type="node" ref="494163268" role="backw**ard_stop"/>
    <member type="way" ref="22742670" role=""/>
    <tag k="ref" v="O10"/>
    <tag k="name" v="Bus O10"/>
    <tag k="type" v="route"/>
    <tag k="route" v="bus"/>
    <tag k="network" v="VRR"/>
    <tag k="description" v="Mettmann S&#xFC;d Stadtwald S - Metzkausen Kantstra&#xDF;e"/>
  </relation>

Comparing this to osmconvert: you'll notice that the special characters are both escaped.

        <relation id="240302" version="2" timestamp="2009-10-22T09:18:37Z" changeset="2919229" uid="188705" user="Yanisin" visible="true">
                <member type="node" ref="494163269" role="forward_stop"/>
                <member type="node" ref="494163268" role="backw&#27;&#27;ard_stop"/>
                <member type="way" ref="22742670" role=""/>
                <tag k="ref" v="O10"/>
                <tag k="name" v="Bus O10"/>
                <tag k="type" v="route"/>
                <tag k="route" v="bus"/>
                <tag k="network" v="VRR"/>
                <tag k="description" v="Mettmann Süd Stadtwald S - Metzkausen Kantstraße"/>
        </relation>

As I lated found out this also doesn't seem to work with expat as XML parser, it throws the following error message "reference to invalid character number". Only & a p o s ; was recognized [Note: extra space chars added on purpose]. This may be an issue with expat though.

=> What would be the best way to deal with this?

joto commented 10 years ago

The escape character (27, 0x1b) can not appear in XML files (spec). Because we use XML in the API this character should have never been added to the database. There was an oversight in the code that allowed this to happen. It has been fixed now ( https://github.com/openstreetmap/openstreetmap-website/issues/758). We have also fixed this in the database, ie. the character was removed from the history. Future history dumps will not contain the character. Because the character can not appear in the data the software doesn't have to take this case into account.

mmd-osm commented 10 years ago

Awesome. Thanks a lot for the update, Jochen!