Validate content of <dc:language>

jon-moreira commented 8 years ago

epubcheck doesn't check dc:language value!

Every metadata section must include at least one language element with a value conforming to [RFC5646].

The following example shows a Publication is in U.S. English.
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    …
    <dc:language>en-US</dc:language>
    …
</metadata>

content.opf of my ePUB after export from Adobe InDesign

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" unique-identifier="bookid" version="2.0">
    <metadata>
        <meta name="generator" content="Adobe InDesign"/>
        <meta name="cover" content="xxx-cover.jpg"/>
        <dc:title>xxxx</dc:title>
        <dc:creator>xxx</dc:creator>
        <dc:subject></dc:subject>
        <dc:description>xxx</dc:description>
        <dc:publisher>Editorial Presença</dc:publisher>
        <dc:date>2016-02-11</dc:date>
        <dc:source></dc:source>
        <dc:relation></dc:relation>
        <dc:coverage></dc:coverage>
        <dc:rights></dc:rights>
        **<dc:language>en-US-POSIX</dc:language>**
        <dc:language>pt-BR</dc:language>
        <dc:identifier id="bookid">xxx</dc:identifier>
    </metadata>

<dc:language>en-US-POSIX</dc:language> doesn't have a valid value and epubcheck ignores that.

epubcheck output:

java -jar epubcheck.jar xxx.epub 
Validating using EPUB version 2.0.1 rules.
No errors or warnings detected.
epubcheck completed

tofi86 commented 7 years ago

While at a first glance this looks easy to implement, it gets harder when you look at the RFC5646 spec and not only in the EPUB example: https://tools.ietf.org/html/rfc5646#appendix-A

Possibly allowed language tags:

de
- (German)
en-US
- (English as used in the United States)
zh-Hans
- (Chinese written using the Simplified Chinese script)
zh-cmn-Hans-CN
- (Chinese, Mandarin, Simplified script, as used in China)
sl-rozaj
- (Resian dialect of Slovenian)
de-CH-1901
- (German as used in Switzerland using the 1901 variant [orthography])
hy-Latn-IT-arevela
- (Eastern Armenian written in Latin script, as used in Italy)
az-Arab-x-AZE-derbend
- (private use subtags)

To be honest: That's a validation nightmare! And I don't see a quick chance to built a validation engine for that...

In fact, It could also be that your example en-US-POSIX is a valid RFC5646 language tag, although it doesn't make sense to us now...

Removing this from the "Next" milestone for the moment...

note to myself: IANA Language Subtag Registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

murata2makoto commented 7 years ago

Does the simple type xsd:language address this problem?

tofi86 commented 7 years ago

Looking at the examples at http://www.datypic.com/sc/xsd/t-xsd_language.html this seems indeed a good way to go! I only looked at this from a Java perspective, but not from the schema validation point of view...

However, when looking at the specs, EPUB->OPF->DublinCore requires RFC5646 which obsoletes the RFC spec XML Schema is defining, right? So the DublinCore meta date may allow more valid language codes than XML schema can validate, although I don't have an example for that.

However, if @mattgarrish as our spec-guru agrees, I would give this a go and change the schema datatype to xsd:language.

mattgarrish commented 7 years ago

The schemas already enforce xsd:language constraints:

opf.dc.language = element dc:language { opf.id.attr? & datatype.languagecode }

datatype.languagecode = datatype.BCP47 datatype.BCP47 = xsd:language { pattern = "[a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*" }

But that just enforces the lexical constraint without trying to verify the validity of the segments. The request, as I understand it, is to go further and validate the segments.

It would be great if that were done, but it seems like no small task and a perpetual moving target.

murata2makoto commented 7 years ago

It would be nice if meaningless tags such as en-US-POSIX are detected. But if some programming (as oppose to schema hacking) is required, I am not sure if this is important enough.

tofi86 commented 6 years ago

Update: @kalaspuffar started working on this in PR #807. Review of the PR is welcome.

rdeltour commented 6 years ago

Unless we check the IANA registry, I don't think there's much we can do here more than the lexical check performed by the schema?

xfq commented 4 years ago

Yes, checking if language tags are valid requires access to or a copy of the registry.

I didn't check EPUB 3.2, but the EPUB 3.0 spec text in the first comment didn't say if it requires the language tag to be well-formed or valid. The LTLI document from W3C i18n WG contains some guidance on this.

mattgarrish commented 4 years ago

We had a long discussion about well-formed v. valid for web publications and the resulting consensus was that there is little value in enforcing validity. Reading systems will react or not based on whether they recognize the language, so ensuring the general pattern is followed is all that is necessary. This really should be clarified in the epub spec.

w3c / epubcheck

Validate content of <dc:language> #702