qgis / QGIS

QGIS is a free, open source, cross platform (lin/win/mac) geographical information system (GIS)
https://qgis.org
GNU General Public License v2.0
10.61k stars 3.01k forks source link

WFS / GML parse issue, but QGIS loads GML as file fine? #45017

Closed rduivenvoorde closed 3 years ago

rduivenvoorde commented 3 years ago

What is the bug or the crash?

Having a Geoserver WFS, QGIS fails to show Features of it. BUT: replaying the request again via curl and downloading the gml, QGIS is fine with it.

QGIS tries to request 1 features several times, BUT says it is not 'well-formed': Retrying request https://myserver/wfs?SERVICE=WFS&REQUEST=GetFeature&VERSION=2.0.0&TYPENAMES=regelink:polygons&COUNT=1&SRSNAME=urn:ogc:def:crs:EPSG::28992: 3/3 2021-09-10T13:28:36 WARNING Error when parsing GetFeature response : Error: not well-formed (invalid token) on line 1, column 3809 If I use curl to retrieve it, QGIS loads it fine. On position 3809, falls on exactly the colon (":") in the following string: regelink:220_1_hsi See b.zip (one feature) and c.zip (more features)

Note: the attributes in this data start with a number (from a postgis db) <= I'm aware this gives troubles Note2: not sure if k:220 cat depict some utf code or so?

Steps to reproduce the issue

Retrying request http://localhost/geoserver/wfs?SERVICE=WFS&REQUEST=GetFeature&VERSION=2.0.0&TYPENAMES=test:polygons&STARTINDEX=0&COUNT=1000000&SRSNAME=urn:ogc:def:crs:EPSG::28992&BBOX=199565.33582072978606448,504735.82987997209420428,200661.82303728291299194,505453.42622588382801041,urn:ogc:def:crs:EPSG::28992: 3/3
2021-09-10T14:01:47     WARNING    Error when parsing GetFeature response : Error: not well-formed (invalid token) on line 1, column 3551

Versions

3.16 -> master

QGIS version 3.21.0-Master QGIS code revision 4e0d0f6692d
Qt version 5.15.2
Python version 3.9.7
GDAL/OGR version 3.2.2
PROJ version 7.2.1
EPSG Registry database version v10.008 (2020-12-16)
GEOS version 3.9.1-CAPI-1.14.2
SQLite version 3.36.0
PostgreSQL client version 13.4 (Debian 13.4-3)
SpatiaLite version 5.0.1
QWT version 6.1.4
QScintilla2 version 2.11.6
OS version Debian GNU/Linux bookworm/sid
       

This copy of QGIS writes debugging output.   |   |   |   Active Python plugins | QuickWKTnominatim_locator_filterNITK_RS-GIS_17pdokservicespluginplugin_reloaderQuickOSMHelloWorldPluginHCMGISGeoCodingsimplesvgorientationsagaproviderprocessinggrassprovider

Supported QGIS version

New profile

Additional context

No response

rduivenvoorde commented 3 years ago

Trying to debug this myself, setting a breakpoint in the parser part where the error is returned: (qgsgml.cpp line 454)...

Grabbing the output ( output.txt ) The error tells line 3555, but could it be something with the underscore '_' (0x5f) in the attribute column names ?

        [3553]  't'     116     0x74    char
        [3554]  ':'     58      0x3a    char
        [3555]  '2'     50      0x32    char
        [3556]  '2'     50      0x32    char
        [3557]  '0'     48      0x30    char
        [3558]  '_'     95      0x5f    char
        [3559]  '1'     49      0x31    char
        [3560]  '_'     95      0x5f    char
        [3561]  'h'     104     0x68    char
        [3562]  's'     115     0x73    char
        [3563]  'i'     105     0x69    char
        [3564]  '>'     62      0x3e    char
rduivenvoorde commented 3 years ago

I was told that the xmltodict module also was 'expat' based (if I am correct the parsing in QGIS is done via expat xml lib), so I tried:

import xmltodict
with open('./c.gml') as fd:
    doc = xmltodict.parse(fd.read())

but that runs fine, no parse issue?

rouault commented 3 years ago

This is a GeoServer bug, not a QGIS one. GeoServer should refuse to expose such a layer directly, or it should modify the attributes whose name starts with a digit. The identifier of a XML element must be a valid QName (https://en.wikipedia.org/wiki/QName), which implies that the unqualified part doesn't start with a digit.

libxml2 rejects b.gml:

$ xmllint --noout b.gml
b.gml:1: namespace error : Failed to parse QName 'regelink:'
link:id><regelink:identifica>1.93100000001959E14</regelink:identifica><regelink:

The OGR GML driver when forced to use Xerces-C too:

$ GML_PARSER=XERCES ogrinfo b.gml -al -q

Layer name: polygons
ERROR 1: XML Parsing Error: invalid element name 'regelink:' at line 1, column 3810

Similarly if using the OGR GMLAS driver:

$ ogrinfo GMLAS:b.gml
ERROR 1: /vsicurl_streaming/https://geoserver-regelink.webgispublisher.nl/wfs?service=WFS&version=2.0.0&request=DescribeFeatureType&typeName=regelink%3Apolygons:12:104 invalid element name '220_1_hsi'

Here Xerces-C rejects the DescribeFeatureType response directly (the GMLAS driver is fully schema aware)

So your question why the OGR GML driver in by default Expat mode does accept that, and QGIS QgsGmlStreamingParser which does use it emits a "not well-formed (invalid token)" error is a good one. I found that the difference is the the OGR GML driver uses Expat in a namespace unaware mode (namespaces of XML elements are discarded by the parser), whereas QGIS uses it in a namespace mode.

And that can be easily seen when using the Python Expat bindings:

$ python
>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> parser.ParseFile(open('b.gml', 'rb'))
1
>>> parser = xml.parsers.expat.ParserCreate(namespace_separator='?')
>>> parser.ParseFile(open('b.gml', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3809

I'd say Expat not rejecting the file in the namespace unaware mode could be considered as a bug (not sure if it is intended, perhaps running in that mode means that people are expected laxer checks...)

I don't think we should try to do something on QGIS side regarding that. If we wanted to do that, that would mean changing the parsing in namespace unaware mode, but this could add potential fragility.

rduivenvoorde commented 3 years ago

Thanks @rouault for your research and explanation.

I created an issue at geoserver: https://osgeo-org.atlassian.net/jira/software/c/projects/GEOS/issues/GEOS-10231

The fact that QGIS parses the same output in different ways: I'm not not really happy with that, it's not very consequent. BUT current behaviour at least makes QGIS a little forgiving (in case of the file at least)...

But I wonder if it would be nice if QGIS would maybe give some more usefull info to the average user. A lot of people are not aware of the Log messages panel, or are just not able to check.

The parsers warnings actually points to ':' or 'regelink:' which are actually fine... it is the next chars that are actually the problem, that tricked me too. Would it help if we show the text of (in the above example) around column 3809 in the error message? And maybe propose some 'common' xml errors: ... uh... like: mwa, never mind.

Should I close this one?