Closed rduivenvoorde closed 3 years ago
Trying to debug this myself, setting a breakpoint in the parser part where the error is returned: (qgsgml.cpp line 454)...
Grabbing the output ( output.txt ) The error tells line 3555, but could it be something with the underscore '_' (0x5f) in the attribute column names ?
[3553] 't' 116 0x74 char
[3554] ':' 58 0x3a char
[3555] '2' 50 0x32 char
[3556] '2' 50 0x32 char
[3557] '0' 48 0x30 char
[3558] '_' 95 0x5f char
[3559] '1' 49 0x31 char
[3560] '_' 95 0x5f char
[3561] 'h' 104 0x68 char
[3562] 's' 115 0x73 char
[3563] 'i' 105 0x69 char
[3564] '>' 62 0x3e char
I was told that the xmltodict module also was 'expat' based (if I am correct the parsing in QGIS is done via expat xml lib), so I tried:
import xmltodict
with open('./c.gml') as fd:
doc = xmltodict.parse(fd.read())
but that runs fine, no parse issue?
This is a GeoServer bug, not a QGIS one. GeoServer should refuse to expose such a layer directly, or it should modify the attributes whose name starts with a digit. The identifier of a XML element must be a valid QName (https://en.wikipedia.org/wiki/QName), which implies that the unqualified part doesn't start with a digit.
libxml2 rejects b.gml:
$ xmllint --noout b.gml
b.gml:1: namespace error : Failed to parse QName 'regelink:'
link:id><regelink:identifica>1.93100000001959E14</regelink:identifica><regelink:
The OGR GML driver when forced to use Xerces-C too:
$ GML_PARSER=XERCES ogrinfo b.gml -al -q
Layer name: polygons
ERROR 1: XML Parsing Error: invalid element name 'regelink:' at line 1, column 3810
Similarly if using the OGR GMLAS driver:
$ ogrinfo GMLAS:b.gml
ERROR 1: /vsicurl_streaming/https://geoserver-regelink.webgispublisher.nl/wfs?service=WFS&version=2.0.0&request=DescribeFeatureType&typeName=regelink%3Apolygons:12:104 invalid element name '220_1_hsi'
Here Xerces-C rejects the DescribeFeatureType response directly (the GMLAS driver is fully schema aware)
So your question why the OGR GML driver in by default Expat mode does accept that, and QGIS QgsGmlStreamingParser which does use it emits a "not well-formed (invalid token)" error is a good one. I found that the difference is the the OGR GML driver uses Expat in a namespace unaware mode (namespaces of XML elements are discarded by the parser), whereas QGIS uses it in a namespace mode.
And that can be easily seen when using the Python Expat bindings:
$ python
>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> parser.ParseFile(open('b.gml', 'rb'))
1
>>> parser = xml.parsers.expat.ParserCreate(namespace_separator='?')
>>> parser.ParseFile(open('b.gml', 'rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3809
I'd say Expat not rejecting the file in the namespace unaware mode could be considered as a bug (not sure if it is intended, perhaps running in that mode means that people are expected laxer checks...)
I don't think we should try to do something on QGIS side regarding that. If we wanted to do that, that would mean changing the parsing in namespace unaware mode, but this could add potential fragility.
Thanks @rouault for your research and explanation.
I created an issue at geoserver: https://osgeo-org.atlassian.net/jira/software/c/projects/GEOS/issues/GEOS-10231
The fact that QGIS parses the same output in different ways: I'm not not really happy with that, it's not very consequent. BUT current behaviour at least makes QGIS a little forgiving (in case of the file at least)...
But I wonder if it would be nice if QGIS would maybe give some more usefull info to the average user. A lot of people are not aware of the Log messages panel, or are just not able to check.
The parsers warnings actually points to ':' or 'regelink:' which are actually fine... it is the next chars that are actually the problem, that tricked me too. Would it help if we show the text of (in the above example) around column 3809 in the error message? And maybe propose some 'common' xml errors: ... uh... like: mwa, never mind.
Should I close this one?
What is the bug or the crash?
Having a Geoserver WFS, QGIS fails to show Features of it. BUT: replaying the request again via curl and downloading the gml, QGIS is fine with it.
QGIS tries to request 1 features several times, BUT says it is not 'well-formed':
Retrying request https://myserver/wfs?SERVICE=WFS&REQUEST=GetFeature&VERSION=2.0.0&TYPENAMES=regelink:polygons&COUNT=1&SRSNAME=urn:ogc:def:crs:EPSG::28992: 3/3 2021-09-10T13:28:36 WARNING Error when parsing GetFeature response : Error: not well-formed (invalid token) on line 1, column 3809
If I use curl to retrieve it, QGIS loads it fine. On position 3809, falls on exactly the colon (":") in the following string:regelink:220_1_hsi
See b.zip (one feature) and c.zip (more features)Note: the attributes in this data start with a number (from a postgis db) <= I'm aware this gives troubles Note2: not sure if k:220 cat depict some utf code or so?
Steps to reproduce the issue
Versions
3.16 -> master
This copy of QGIS writes debugging output. | | | Active Python plugins | QuickWKTnominatim_locator_filterNITK_RS-GIS_17pdokservicespluginplugin_reloaderQuickOSMHelloWorldPluginHCMGISGeoCodingsimplesvgorientationsagaproviderprocessinggrassprovider
Supported QGIS version
New profile
Additional context
No response