w3c / epubcheck

The conformance checker for EPUB publications
https://www.w3.org/publishing/epubcheck/
BSD 3-Clause "New" or "Revised" License
1.65k stars 402 forks source link

i18n for Schematron and Jing/RNG messages #474

Open tofi86 opened 10 years ago

tofi86 commented 10 years ago

I researched about localizing schematron messages and found two ways:

I think the diagnostics way looks promising, but I don't understand how schematron validation is performed in epubcheck and with which tools. But it seems as if they don't support the diagnostics method...

And then, for the Jing RNG messages: I haven't worked with RNG so far, so I don't know how to do localization there anyways... Any ideas?


I tried with opf.sch for EPUB 2 and created a demo file with 2 identical ID's in OPF file.

This is my demo opf.sch:

<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" xml:lang="en">

  <sch:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/" />
  <sch:ns prefix="opf" uri="http://www.idpf.org/2007/opf" />

  <sch:pattern name="opf_idAttrUnique" id="opf_idAttrUnique">
      <!-- id attribute value must be unique for any id attribute in opf file-->
      <sch:rule context="//*[@id]">
        <sch:assert test="count(//@id[. = current()/@id]) = 1" diagnostics="d1 d2">
            [ORIG] The "id" attribute does not have a unique value
        </sch:assert>
      </sch:rule>
  </sch:pattern>

  <sch:diagnostics>
    <sch:diagnostic id="d1" xml:lang="en">
        [EN] The "id" attribute does not have a unique value
    </sch:diagnostic>
    <sch:diagnostic id="d2" xml:lang="de">
        [DE] German message
    </sch:diagnostic>
  </sch:diagnostics>

</sch:schema>

This isn't working on my german system. Validation always shows [ORIG] message...

tofi86 commented 10 years ago

And this would be the ISO schematron file:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2" xml:lang="en">

  <sch:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/" />
  <sch:ns prefix="opf" uri="http://www.idpf.org/2007/opf" />

  <sch:pattern id="opf_idAttrUnique">
      <!-- id attribute value must be unique for any id attribute in opf file-->
      <sch:rule context="//*[@id]">
        <sch:assert test="count(//@id[. = current()/@id]) = 1" diagnostics="d1 d2">
            [ORIG] The "id" attribute does not have a unique value
        </sch:assert>
      </sch:rule>
  </sch:pattern>

  <sch:diagnostics>
    <sch:diagnostic id="d1" xml:lang="en">
        [EN] The "id" attribute does not have a unique value
    </sch:diagnostic>
    <sch:diagnostic id="d2" xml:lang="de">
        [DE] German message
    </sch:diagnostic>
  </sch:diagnostics>

</sch:schema>
tofi86 commented 10 years ago

And then, for the Jing RNG messages: I haven't worked with RNG so far, so I don't know how to do localization there anyways... Any ideas?

messages.properties file here src/main/resources/com/thaiopensource/relaxng/pattern/resources should do the trick for RNG localization, right @takahashim?

murata2makoto commented 10 years ago

Tobias,

Nanba-san (who first started the Japanese error message for epubcheck), Takahashi-san, and I have discussed about this. I also spoke with schema language experts of SC34, which does the ISO version of Schematron.

A short term solution is to prepare a schematron file for each locale. At run time, we choose one of them depending on the locale. This does not require a lot of changes to epubcheck. But duplication of schematron logic would become a problem. Namba-san proposed to have one master schematron file and one message file for each locale and generate different schematron files for different locales.

murata2makoto commented 10 years ago

Let us ask some questions before making a decision.

First, should we continue to use Schematron or migrate to pure Java for checking integrity constraints? If in-memory XML, which is required by Schematron, causes performance problems, we should think about such migration. In the case of Validator.Nu, they migrated from Schematron to Java for this reason. But if epubcheck is typically used as a client program rather than a server program, performance is not very important.

Second, which version of Schematron should we use? Schematron files in epubcheck are already ISO Schematron, since they use the namespace of ISO Schematron. Although a revision of ISO schematron is in progress at SC34, I think that it is mature enough. Thus, I believe that we should continue to use ISO Schematron V1 (if we continue to use Schematron).

Third, which implementation of Schematron should we use?
We have used Jing. Although it is a great implementation of RELAX NG, I do not know if its maintainers are committed to maintain the Schematron implementation. See https://code.google.com/p/jing-trang/issues/list?can=2&q=schematron and https://code.google.com/p/jing-trang/issues/detail?id=169

We might want to migrate to the reference implementation of Schematron (http://www.schematron.com/implementation.html), Thoughts?

murata2makoto commented 10 years ago

Oops. I wrote ", I think that it is mature enough." But the ongoing revision of ISO Schematron is NOT mature enough.

mgylling commented 10 years ago

Hi Makoto, I don't think we have a performance issue worth worrying about with the current schematron; we have avoided the classic schematron performance pitfall by using xsl:keys that are only set once per instantiation. As we migrate to the W3C HTML5 validator, we will of course be pruning many of the html schematron tests anyway (assuming we can use their SAX tests as well); but tests for other file types can be kept at this point IMO.

As for handling messages L12N in schematron: one approach I used in another project a few years ago was to just have one schematron file, but instead of it having literal error messages, it contained a key/id instead, that the javacode later picked up and grabbed a message string for using the common java localization approach (that is already in use in Epubcheck).

In other words, the schematron would look something like this: <assert ...>[[SOME_KEY]] with a corresponding messages properties entry: [[SOME_KEY]]=Literal error message in some locale

The only thing that needs to be settled and used consistently is how to distinguish a KEY from a literal message programmatically (especially as the java code that does the lookup does not necessarily know who emitted the message). The "[[" or similar string is for this purpose, and can of course be anything as long as it would never be the lead-in of a string in an ordinary message.

tofi86 commented 10 years ago

hey,

But duplication of schematron logic would become a problem. Namba-san proposed to have one master schematron file and one message file for each locale and generate different schematron files for different locales.

No, duplication the whole files doesn't really sound very good... I'd rather prefer the master-schematron way, or @mgylling's Java string solution...

Sad, that there's no solution in "pure" schematron which is supported by Jing... But we obviously haven't been the first to ask for that feature, as you pointed out, @murata0204. (https://code.google.com/p/jing-trang/issues/detail?id=169)

Second, which version of Schematron should we use? Schematron files in epubcheck are already ISO Schematron, since they use the namespace of ISO Schematron.

Not all! There are at least some with namespace http://www.ascc.net/xml/schematron, as mentioned above...

We might want to migrate to the reference implementation of Schematron (http://www.schematron.com/implementation.html), Thoughts?

You mean the "pure" skeleton / XSLT way? Does it support diagnostics for multi-lingual messages?

murata2makoto commented 10 years ago

Markus, yes, I remember what you said in Tokyo. I invited Namba-san to join this discussion.

Tobias,yes, of course. Some schematron files are Schematron 1.5. I think that we should migrate to ISO schematron. Differences between ISO Schematron and Schematron 1.5 are minor (see http://www.schematron.com/spec.html). I believe that we only have to change the namespace name.

I am not sure if the pure skeleton XSLT implementation supports multi-lingual error message. I will try and report back soon.

murata2makoto commented 10 years ago

Markus, I think that it is reasonable for us to stick to Schematron.
I just wanted to record this discussion as part of this issue.

murata2makoto commented 10 years ago

I tried the pure skeleton XSLT implementation of ISO Schematron. It emits an XML document that contains message in all locales. From this XML document, we can extract error message for a particular locale.

But how can we report line numbers? epubcheck now reports line numbers even when errors are detected by Schematron. Are the line numbers reported by Jing?

mgylling commented 10 years ago

Yes, they are reported by the SAX parser that drives the Jing schematron process.

What is the reason for considering to move to the pure XSLT implementation?

/markus

murata2makoto commented 10 years ago

The pure skeleton XSLT implementation is intended to be a fully conformant implementation by the original designer of Schematron Migration would be nice if it does not require too much work. But it appears to be non-trivial.

tofi86 commented 10 years ago

What is the reason for considering to move to the pure XSLT implementation?

Skeleton would provide support for multi-lingual schematron messages with diagnostics.

But obviously it might be harder to implement than we thought, because it can't provide line numbers for messages.

murata2makoto commented 10 years ago

Markus,

Namba-san and I exchanged some e-mails, and he enlightened me.

I have a question about your approach to L12N of Schematron error message.

First, consider

<assert ...>shall specify 'application/smil+xml' rather than '<value-of select="$item-media-type"/>'

The Schematron validator is expected to expand this as

shall specify 'application/smil+xml' rather than 'text/html'

when $item-media-type is 'text/html'. This expansion happens when Schematron is executed agianst an instance document.

Now, suppose that we want to have Japanese (in romaji) error message:

'text/html' ga shiteisarete imasuga 'application/smil+xml' wo shiteishite kudasai

This would require

<assert ...>'' ga shiteisarete imasuga 'application/smil+xml' wo shiteishite kudasai

How can we do this? A simple expansion of a key by a value does not work well.

mgylling commented 10 years ago

Well, admittedly I didnt have to deal with composed messages in that project a few years ago... ;)

I havent looked at all our schematron tests, and have too little time right now in general to think about this carefully (so thanks to you and Namba-san), but in theory this could be supported by a)allowing multiple keys per message and b)using the rule that the localized message that corresponds to key only replaces the string scope of the key, not the entire message string.

In other words,

shall specify 'application/smil+xml' rather than 'text/html'

would be expressed in the schematron as:

[[KEY_1]] 'application/smil+xml' [[KEY_2]] select="$item-media-type" ... and in messages.properties we have [[KEY_1]]=shall specify [[KEY_2]]=rather than Looking at it, I do realize this can get difficult to get right linguistically/grammatically for different languages, since it wouldn't support reordering of words/phrases, and so on.
murata2makoto commented 10 years ago

Markus,

Note that the Japanese version is

'text/html' ga shiteisarete imasuga 'application/smil+xml' wo shiteishite kudasai

where text/html precedes application/smil+xml while the latter precedes the former in the English version. This can be overcome if we introduce a parameterized keys, but things get even more difficult.

Namba-san and I will try to provide details of an alternative proposal.

mgylling commented 10 years ago

Namba-san and I will try to provide details of an alternative proposal.

Right. As you have probably seen already, but just to be sure: in the Maven build script there is already a process that runs the schematron files through an XSLT process (that resolves includes and a few other things IIRC, all things that aren't supported by Jing's schematron implementation). I suppose it would be good if the solution we come up with could be amended to this process (e.g. ideally be implemented in XSLT), as it would minimize the addon code and logic we will have to support. Imagine how there is only one schematron file in the source, and the build script creates _ja.sch, _de.sch etc automatically...

murata2makoto commented 10 years ago

Here is an example to depict Nanba-san's schema rewriting approach.

First, we have a master Scheamatron schema package-30.sch, which is in English. But note that @msg:id is added.

<schema ... xmlns:msg="http://idpf.org/2014/epubcheck/i18n-messages/1.0"
<pattern ...>
  <rule ...>
    <assert test="..." msg:id="invalid_media_type_in_media_overlay_item"
      >media overlay items must be of the 'application/smil+xml' type (given type was '<value-of select="$item-media-type"/>')</assert>
  </rule>
</pattern>

Second, we have a message document, say package-30_msg_ja.xml, for the Japanese language. The root element is <messages>.
It children are <message> elements having @id attributes.

package-30_msg_ja.xml

<messages xmlns="http://idpf.org/2014/epubcheck/i18n-messages/1.0">
  <message id="invalid_media_type_in_media_overlay_item">media-overlay
  zokusei wo shiteishita item youso no media-type zokuseichiniha
  'application/smil+xml' ga shitei sarenakuteha narimasen('<value-of
  select="$item-media-type"/>' ga shitei sarete imasu)</message>
 ...
</message>

From package-30.sch and package-30_msg_ja.xml, we generate package-30_ja.sch. This can be certainly done by XSLT.

<schema ... xmlns:msg="http://idpf.org/2014/epubcheck/i18n-messages/1.0"
<pattern ...>
  <rule ...>
    <assert test="..." msg:id="invalid_media_type_in_media_overlay_item"
      >media-overlay
  zokusei wo shiteishita item youso no media-type zokuseichiniha
  'application/smil+xml' ga shitei sarenakuteha narimasen('<value-of
  select="$item-media-type"/>' ga shitei sarete imasu)</assert>
  </rule>
</pattern>

At run time, we use package-30_ja.sch for the Japanese locale.

tofi86 commented 10 years ago

This sounds like a great idea!

If you want I can contribute the XSLT part...

georgebina commented 9 years ago

Hi all,

Tobias pointed me to this discussion, so here it is my feedback on this:

The ISO Schematron support in Jing is really at its beginning - mainly I copied the 1.6 support and updated that to recognize the ISO Schematron namespace. This was done initially in the oNVDL project, that forked Jing to provide NVDL validation. The NVDL validation support was merged back into Jing and at some point someone asked to support ISO Schematron also in Jing so I added that initial ISO Schematron support that I already had in oNVDL to Jing https://code.google.com/p/jing-trang/source/detail?r=2357

FWIW, Jing's Schematron implementation is also based on XSLT.

I think the simplest approach will be to use diagnistics as in the original example posted by Tobias and in the XSLT based pre-processing step that already exists take the current locale as parameter and expand the correct diagnostic in the message field. More clearly

<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" xml:lang="en">

  <sch:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/" />
  <sch:ns prefix="opf" uri="http://www.idpf.org/2007/opf" />

  <sch:pattern name="opf_idAttrUnique" id="opf_idAttrUnique">
      <!-- id attribute value must be unique for any id attribute in opf file-->
      <sch:rule context="//*[@id]">
        <sch:assert test="count(//@id[. = current()/@id]) = 1" diagnostics="d1 d2">
            [ORIG] The "id" attribute does not have a unique value
        </sch:assert>
      </sch:rule>
  </sch:pattern>

  <sch:diagnostics>
    <sch:diagnostic id="d1" xml:lang="en">
        [EN] The "id" attribute does not have a unique value
    </sch:diagnostic>
    <sch:diagnostic id="d2" xml:lang="de">
        [DE] German message
    </sch:diagnostic>
  </sch:diagnostics>

</sch:schema>

may become for locale "de":

<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" xml:lang="en">

  <sch:ns prefix="dc" uri="http://purl.org/dc/elements/1.1/" />
  <sch:ns prefix="opf" uri="http://www.idpf.org/2007/opf" />

  <sch:pattern name="opf_idAttrUnique" id="opf_idAttrUnique">
      <!-- id attribute value must be unique for any id attribute in opf file-->
      <sch:rule context="//*[@id]">
        <sch:assert test="count(//@id[. = current()/@id]) = 1">
            [DE] German message
        </sch:assert>
      </sch:rule>
  </sch:pattern>
</sch:schema>

or some variation of this.

When Jing will be updated to support diagnostics and locale messages or if we move to a different implementation that supports this we can just remove this additional pre-processing.

I hope this helps!

Regards, George

murata2makoto commented 9 years ago

George,

Nice to see you here!

Use of XSLT for preprocessing Schematron schemas sounds nice to me.

Namba san and I think that we need a single message file for each natural language. If we have a single file for all natural languages, everybody will so easily destroy message written in natural languages that he does not understand.

Namba-san's work has been disturbed by a budget problem, but it is expected to be solved by the end of this month.

georgebina commented 9 years ago

Dear Makoto san,

It should be possible to put all the messages for a specific language in a file like

<sch:diagnostics>
    <sch:diagnostic id="de.d1" xml:lang="de">
        [DE] German message 1
    </sch:diagnostic>
    <sch:diagnostic id="de.d2" xml:lang="de">
        [DE] German message 2
    </sch:diagnostic>
  ...
</sch:diagnostics>

and then sch:include each such file in the main schema. One of the advantages of this approach is that we stay within the Schematron semantics and we do not need to add an annotation layer on top of Schematron but an annotation as you mentioned should be also fine - if a tools does not understand msg:id attributes it will just ignore them. One thing to pay attention with this is the namespaces for elements that can be used within a message. For example value-of is placed in the example in the ...i18n-messages/1.0 namespace and it should probably stay within the Schematron namespace.

Best Regards, George

murata2makoto commented 9 years ago

Namba-san and I think that Goerge's approach is good. It does not deviate from the Schematron syntax and semantics but can overcome limitations of existing implementations. We can also avoid a bug in handling sch:diagnostic//sch:value-of (see https://code.google.com/p/jing-trang/issues/detail?id=183).

Additional XSLT has to expand sch:include and replace the content of sch:assert by the content of sch:diagnostic. It probably needs a parameter for specifying a locale.