mtmse / pipeline-mod-mtm

MTM specific modules for the DAISY Pipeline 2
0 stars 0 forks source link

Behaviour is not the same as in standalone dotify #14

Open joeha480 opened 8 years ago

joeha480 commented 8 years ago

Running with this file: http://www.daisy.org/samples/dtbook-xml/great-painters-dtbook-2005-3.zip

in DP2 resulted in an error. Processing terminated by xsl:message at line 331 in dtbook2flow_sv_SE_braille.xsl

However, when running standalone Dotify, the file is processed just fine.

Could this be an environment issue related to the xml processor?

bertfrees commented 8 years ago

See https://github.com/joeha480/dotify/blob/ed810ee7444ee14c4dd108bd357921815f88e871/dotify.task.impl/src/org/daisy/dotify/impl/input/xml/resource-files/sv-SE/xslt-files/dtbook2flow_sv_SE_braille.xsl#L331

bertfrees commented 8 years ago

There is some more debug info on STDERR (tested with Mischa's file):

Unexpected text contents in "note" element.
Error at xsl:message on line 331 of dtbook2flow_sv_SE_braille.xsl:
  XTMM9000: Processing terminated by xsl:message at line 331 in dtbook2flow_sv_SE_braille.xsl
  at xsl:apply-templates (bundle://33.0:1/org/daisy/dotify/impl/input/xml/resource-files/sv-SE/xslt-files/dtbook2flow_sv_SE_braille.xsl#299)
     processing /dtbook/book[1]/rearmatter[1]/level1[1]/note[1]
  at xsl:call-template name="insertNoteCollection" (bundle://33.0:1/org/daisy/dotify/impl/input/xml/resource-files/sv-SE/xslt-files/dtbook2flow_sv_SE_braille.xsl#33)
bertfrees commented 8 years ago

@joeha480 By just looking at the XSLT it seems reasonable to throw an error for this input:

<note id="fn1">
  <p>Dies ist der Kommentar zu Fussnote 11</p>
</note>

In the template at line 328, the variable $note has 3 child nodes: a p element and two text nodes at both sides of the p. The first text node triggers the terminating message.

joeha480 commented 8 years ago

Sure, but it works standalone, so if there is an error here now, there could be errors in other places with similar XPaths (which might not report it as errors). We need to know what is causing this.

bertfrees commented 8 years ago

It looks like you have configured Saxon (or Saxon configured itself) to automatically strip space from the source document (https://www.w3.org/TR/xslt20/#strip). Could it have to do with the DOCTYPE declaration? Maybe your Saxon analyses the DTD and decides that the white space is insignificant?

joeha480 commented 8 years ago

Yes I think that could be it. I tried to google this, but I didn't find it. But you did!

In practice this means that elements that are defined in a DTD or a Schema to contain element-only content will have whitespace text nodes stripped, regardless of the xsl:strip-space and xsl:preserve-space declarations in the stylesheet.

Do you know what is causing the difference in the saxon configuration? Is dtd resolution turned off somewhere?

bertfrees commented 8 years ago

You can initialize Saxon with or without schema awareness. And indeed Pipeline does it without. Dotify probably does it with (explicitly in your code or not, I don't know how you do it). I don't know if we did it like we did for a specific reason. I can imagine a possible reason is to make the system behave the same with or without internet access. @capitancambio @rdeltour: do you know?

By the way Pipeline initializes Saxon through Calabash (XProcConfiguration.java) but that is not relevant to the issue.

capitancambio commented 8 years ago

I'm not sure about the motivation of starting saxon without schema awareness, probably not needed?

rdeltour commented 8 years ago

I don't remember either. Isn't it Calabash's default?

bertfrees commented 8 years ago

Maybe yes. The question now is whether there is any drawback to setting schemaAware to true? I don't think any of our XSLTs depend on schemaAware to be false? And we could possibly add the DTBook DTD to some catalog.xml to get a deterministic system even when there is no internet access.

Or should we make it a configuration option?

capitancambio commented 8 years ago

The dtbook dtd should be there, offered by at least one of our scripts. In any case, any static asset like dtds should be offered by common modules or the script itself as a static resource.

cheers,

Javi

bertfrees commented 8 years ago

@rdeltour @capitancambio What should we do with this? Making it a configuration option doesn't seem right if the behavior of the modules depends on it. (TBH I think it's a bit weird that Saxon has this configuration option in the first place.) Shall we just try and enable it and see what happens?

rdeltour commented 8 years ago

@bertfrees yes you can try to enable it; but then we should be careful about testing all our scripts.

rdeltour commented 8 years ago

@bertfrees @joeha480 I had a second look at this issue: it appears the root of the issue comes from the XSLT which doesn't cover the case were there are whitespace-only nodes in the note element. That's a bug IMO.

That schema-aware processing solves it is one thing, almost a "side effect", but the XSLT should also work in non-schema-aware processing.
In that case, it would be reasonably straightforward to fix the XSLT with some xslt:strip-space declaration (preferably) or by explicitly testing the content of text nodes in the xsl:when/@test expression.

And also, the reason why we don't configure Saxon to be schema-aware is probably that Saxon HE doesn't provide the feature :smile:. So I'm afraid that changing the configuration won't solve anything in the default Pipeline distribution.
BTW, if you're testing your XSLT standalone in oXygen or somewhere else, it's usually good to run it with Saxon HE too, at least from time to time.

Sorry if I didn't pay more attention previously!

joeha480 commented 8 years ago

I don't think it's a bug in the xslt (but maybe I should explicitly set schema aware in the stylesheet using xsl:import-schema if that is even what it is for). It's very possible however that I haven't instantiated Saxon correctly in OSGi context. That whole area is a bit unclear to me.

If Saxon isn't providing the feature, then where could I be getting the feature from? This is the list of third party dependencies: jing-20120724.0.0.jar Saxon-HE-9.5.1-5.jar saxon-he-9.5.1.5.jar wstx-lgpl-3.2.7.jar (I have tried without this) xercesImpl-2.11.0.jar (and without this) xml-apis-1.4.01.jar

@bertfrees: I guess we could disable schema awareness in Dotify and run regression tests on it (because there might be other places where I use this feature). Anyway, we need to solve this before finishing the project. I can give you access to my regression tests, so that you can run them, if you want.

bertfrees commented 8 years ago

@rdeltour: Right, so that's the reason Saxon has the option :)

By the way, yesterday I also realized that changing the configuration in calabash-adapter wouldn't necessarily have had an impact on Joel's step because the step invokes the XSLT through Java (Dotify) so it depends on how Joel invokes the XSLT in Dotify.

Regarding fixing the XSLT: yes, that was my first reaction too. But then again, it also seems reasonable to make certain assumptions about the input, or to rely on a processor feature. I wouldn't say the XSLT is "broken".

@joeha480: Yes, xsl:import-schema appears to be the right command, however it's only Saxon-EE: http://www.saxonica.com/documentation9.5/xsl-elements/import-schema.html.

I wonder how you even made it work in standalone Dotify? :confused: Are you sure you are using Saxon-HE? EDIT: OK you asked the same question. Where are you getting your XSLT transformer object from in Java? Can you give me a pointer?

Another idea I got was to add an explicit pre-processing step that strips the unwanted spaces based on the schema. Maybe it is straightforward to do (i.e. without the Saxon feature).

rdeltour commented 8 years ago

I don't think it's a bug in the xslt (@joeha480) I wouldn't say the XSLT is "broken". (@bertfrees)

Well, I used "bug" to make my point (rather than to point the finger :smile: ), but yes I actually think that the XSLT should work with non-schema-aware processors. Unless it requires schema-awareness by contract, but I think there's nothing justifying it in this case.

What's wrong with using xsl:strip-space or testing ws-only text nodes?

As for why it works in standalone Dotify, my wild guess is that the XML source tree is built with another parser, not directly from Saxon.

bertfrees commented 8 years ago

I think the problem with testing ws-only text nodes is that the same issue could exist in other XSLTs, so we'll have to go through all of them.

Using strip-space could be a solution. It's also not completely automatic, but it shouldn't be hard to do it based on the schema. (Wouldn't it be possible to automatically generate a xsl:strip-space element from a schema document? Or am I making it too complicated?)

As for why it works in standalone Dotify, my wild guess is that the XML source tree is built with another parser, not directly from Saxon.

Ah, that would explain it, yes. And in that case it would indeed be a "side effect".

rdeltour commented 8 years ago

the same issue could exist in other XSLTs, so we'll have to go through all of them.

Yes, eventually I think that all other XSLT s/b tests without schema awareness.

Using strip-space could be a solution. It's also not completely automatic, but it shouldn't be hard to do it based on the schema. (Wouldn't it be possible to automatically generate a xsl:strip-space element from a schema document? Or am I making it too complicated?)

It depends: how many XSLTs or different schemas are we talking about?

Depending on the number of XSLT, a quick manual review of the pain points (like failing on text nodes, regardless of possibly empty content) might be preferrable?

bertfrees commented 7 years ago

To do: look at "STRIP_WHITESPACE" setting in http://www.saxonica.com/html/documentation/configuration/config-features.html.