opensensorhub / osh-core

OpenSensorHub Core Modules
http://docs.opensensorhub.org
Mozilla Public License 2.0
32 stars 16 forks source link

[CSAPI] SensorML XML body rejected when creating procedure #251

Closed matthijskooijman closed 1 day ago

matthijskooijman commented 8 months ago

When I create a new procedure through the CSAPI and use SensorML XML as the body, I get:

HTTP ERROR 400 Invalid payload: Invalid XML:  /0/ 0 / Received event END_DOCUMENT, instead of START_ELEMENT or END_ELEMENT at [row,col {unknown-source}]: [1,283]

To reproduce:

curl -X POST --data @- -H "Content-Type: application/sml+xml" "http://admin:admin@localhost:8080/sensorhub/consys/procedures" <<EOF
<?xml version="1.0" encoding="UTF-8"?>
  <sml:PhysicalComponent xmlns:sml="http://www.opengis.net/sensorml/2.0" xmlns:gml="http://www.opengis.net/gml/3.2">
      <gml:identifier codeSpace="uniqueID">urn:someid:111</gml:identifier>
      <gml:name>Name</gml:name>
</sml:PhysicalComponent>
EOF

The procedure is actually created, but it seems that the code tries to read another procedure XML from the body and then gets confused about whether more data is available. I have not found any way to circumvent this issue from the client.

With #247 applied, I obtained the following backtrace:

org.sensorhub.impl.service.consys.InvalidRequestException: Invalid payload: Invalid XML:  /0/ 0 / Received event END_DOCUMENT, instead of START_ELEMENT or END_ELEMENT.
 at [row,col {unknown-source}]: [1,283]
        at NativeClassLoader//org.sensorhub.impl.service.consys.ServiceErrors.invalidPayload(ServiceErrors.java:51)
        at NativeClassLoader//org.sensorhub.impl.service.consys.resource.BaseResourceHandler.create(BaseResourceHandler.java:358)
        at NativeClassLoader//org.sensorhub.impl.service.consys.resource.BaseResourceHandler.doPost(BaseResourceHandler.java:141)
        at NativeClassLoader//org.sensorhub.impl.service.consys.RootHandler.doPost(RootHandler.java:53)
        at NativeClassLoader//org.sensorhub.impl.service.consys.RestApiServlet.lambda$doPost$1(RestApiServlet.java:208)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.sensorhub.impl.service.consys.ResourceParseException: Invalid XML:  /0/ 0 / Received event END_DOCUMENT, instead of START_ELEMENT or END_ELEMENT.
 at [row,col {unknown-source}]: [1,283]
        at NativeClassLoader//org.sensorhub.impl.service.consys.sensorml.SmlProcessBindingSmlXml.deserialize(SmlProcessBindingSmlXml.java:106)
        at NativeClassLoader//org.sensorhub.impl.service.consys.sensorml.SmlProcessBindingSmlXml.deserialize(SmlProcessBindingSmlXml.java:44)
        at NativeClassLoader//org.sensorhub.impl.service.consys.resource.BaseResourceHandler.create(BaseResourceHandler.java:321)
        ... 10 common frames omitted
Caused by: com.ctc.wstx.exc.WstxParsingException: Received event END_DOCUMENT, instead of START_ELEMENT or END_ELEMENT.
 at [row,col {unknown-source}]: [1,283]
        at NativeClassLoader//com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634)
        at NativeClassLoader//com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:504)
        at NativeClassLoader//com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:488)
        at NativeClassLoader//com.ctc.wstx.sr.BasicStreamReader.nextTag(BasicStreamReader.java:1229)
        at NativeClassLoader//org.sensorhub.impl.service.consys.sensorml.SmlProcessBindingSmlXml.deserialize(SmlProcessBindingSmlXml.java:88)
        ... 12 common frames omitted

This points me to this code:

https://github.com/opensensorhub/osh-core/blob/d111a6972126dc238af7ac2c0e26adf88db8b1be/sensorhub-service-consys/src/main/java/org/sensorhub/impl/service/consys/resource/BaseResourceHandler.java#L321

Which tries to deserialize objects until there are no more.

This calls the following code:

https://github.com/opensensorhub/osh-core/blob/d111a6972126dc238af7ac2c0e26adf88db8b1be/sensorhub-service-consys/src/main/java/org/sensorhub/impl/service/consys/sensorml/SmlProcessBindingSmlXml.java#L70-L77

Which seems correct - it checks if there is more to deserialize, and if so reads the next tag.

Debug results

However, in practice hasNext() returns true, but then nextTag() concludes that there is nothing left to parse, and raises.

hasNext() looks like this:

    public boolean hasNext() {
        // 08-Oct-2005, TSa: In multi-doc mode, we have different criteria...
        return (mCurrToken != END_DOCUMENT)
            || (mParseState == STATE_MULTIDOC_HACK);
    }

Debug printing (using getEventType()) shows that mCurrToken == 2 (END_ELEMENT according to this source where I think this value comes from), so hasNext() returns true.

However, then when next() is called (as done by nextTag()), mCurrToken changes to 8 (END_OF_DOCUMENT), which causes nextTag() to raise.

Multi-document mode

The woodstox code has some reference to a "multidoc" mode, but from what I've seen I do not think this is actually enabled (but I can't recall details - I looked at this last week).

Since the OSH code is actually trying to deserialize multiple documents, I wondered about this. I also tried to actually POSTing multiple documents (to see if this would work, and to ensure that any fix I would apply would not break this), but I was unsure how this would be formatted (a few naive attempts failed). I looked around the unit tests, but did not immediately saw a test for this either.

Upgrading woodstox

I already tried upgrading woodstox to the latest 6.6.0 version (by modifying osh-node-dev-template/include/osh-core/lib-ogc/swe-common-core/build.gradle), but then the problem persists.

How to fix

I am not quite familiar with how woodstox is supposed to work in this regard, so I'm not quite sure if this is a bug in woodstox or that OSH is using the woodstox API incorrectly. I tried figuring out, but the XML parsing code is so complex that I did not really know.

I've also checked the woodstox API docs, but for hasNext() and nextTag() that just refers to the java builtin XmlStreamReader interface.

IIUC, the reader has a concept of a current event (type returned by getEventType()), and next() reads the next event and makes it current. After all events there is one final END_OF_DOCUMENT event. Once that event is the current one, hasNext() returns false and next() should not be called again.

In this case, once the first deserialize is completed, the current event is still the END_ELEMENT of the previous top-level document. This means that hasNext() will always return true.

Ideally, we would have a peek() method, but that does not exist (there is a peekNext() method, but it is from the StreamScanner superclass and seems to return individual characters, not parsed events.

To ensure that hasNext() returns the proper result , you could just call next() before calling hasNext(), which would work if there is no more data. However, if there is more data, this means that the current event is now (probably) a START_ELEMENT, ready to be processed. But since nextTag starts with calling next(), this START_ELEMENT is then skipped. So we could

  1. not call nextTag() at all, but then this would break if there is whitespace or comments between the documents (which is what nextTag() skips and smlBindings.readDescribedObject() will choke on that).
  2. call nextTag() only when the current type is something nextTag() would have skipped as well, but that duplicates a bunch of logic from nextTag(). I noticed there is an isWhitespace() method, which prevents having to duplicate the fairly complex whitespace detection, but you would still end up duplicating the list of things to skip (whitespace, comments).
  3. call nextTag() only if the current type is not START_ELEMENT or END_ELEMENT (which is what nextTag skips towards), but then you might end up unintentionally skipping something that is not whitespace or comments.
  4. not add a call to next(), but just call nextTag() as now, but catch the exception. Unfortunately nextTag() can also cause other exceptions that are undistinguishable except for the message, but we can check the current type in the except handler, and if it is END_OF_DOCUMENT (or hasNext() returns false), ignore the exception and return.

After writing this, things become a little more clear. I have the idea that option 4. would be the most robust working solution. I'll see if I can get this to work and maybe create a PR for this.

Other affected code

From looking through the code, it seems the only other XML parsing code that works like this (found using git grep -A 5 hasNext | grep -B 5 nextTag) is the SmlFeatureBindingSmlXml class which contains the same construct, which I expect is broken in the same way.

There is also SmlFeatureBindingSmlJson which has very similar code, but that actually uses peek() (which the JSON parser does provide), so I expect this problem does not exist there (except maybe when there is trailing whitespace, which might not trigger hasNext(), but might cause nextTag() to raise - I have not investigated).

matthijskooijman commented 8 months ago

not add a call to next(), but just call nextTag() as now, but catch the exception. Unfortunately nextTag() can also cause other exceptions that are undistinguishable except for the message, but we can check the current type in the except handler, and if it is END_OF_DOCUMENT (or hasNext() returns false), ignore the exception and return.

This seems to work, see #252 for a PR implementing it.