oasis-tcs / xliff-omos-jliff

OASIS XLIFF OMOS TC: JSON serialization of the XLIFF Abstract Object Model
https://github.com/oasis-tcs/xliff-omos-jliff
Other
17 stars 4 forks source link

XML Invalid Characters #50

Open philinthecloud opened 2 years ago

philinthecloud commented 2 years ago

In the August 28th meeting we debated the use of Control Characters such as \u0000-\u0008 which are invalid in XML 1.0 (but allowed in XML 1.1) within JSON.

This issue is by way of collecting some knowledge and examples to help the discussion.

Two Stackoverflow posts contain some information on XML Invalid Characters and valid JSON characters.

Roundtripping

If I convert and serialize the JLIFF document below to XLIFF then I get a warning when I open the XIFF in an XML editor. However, I could intercept control characters in my code and convert them to elements.

{
  "subgroups": [],
  "jliff": "2.1",
  "srcLang": "en-US",
  "trgLang": "de-DE",
  "files": [
    {
      "id": "file1",
      "kind": "file",
      "notes": [],
      "subfiles": [
        {
          "canResegment": "no",
          "id": "unit1",
          "kind": "unit",
          "its_locQualityIssuesArray": [],
          "notes": [],
          "subunits": [
            {
              "canResegment": "no",
              "kind": "segment",
              "source": [
                {
                  "text": "\u0007 Target 1"
                }
              ],
              "target": [
                {
                  "text": "Target 1"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
DavidFatDavidF commented 2 years ago

@philinthecloud I agree (and it should be part of the roundtripping appendix) that XML illegal characters MUST be intercepted and replaced with cp when converting to XLIFF The issue is that we naively assumed that XML illegal characters are simply legal as literals in JSON, so we haven't specified a mechanism to wrap XML illegals in JLIFF.. I think we need to figure out if there is a subset of XML illegals that are JSON legal and which XML illegals (expressed as cp in XLIFF or LDML) MUST be replaced with \u prefixed hexcodes when converting to JSON/JLIFF. This we need then to describe as the canonical behavior at least in the roundtrip appendix..

genivia-inc commented 2 years ago

There are no limits on JSON characters in property names and strings. Any character is valid since special characters can be escaped e.g. \b and others represented by \u codes if not in UTF. Invalid Unicode characters should be avoided such as surrogates U+D800 to U+DFFF. I don't see the issue. The issue is with XLIFF and proper translation from JLIFF to XLIFF requires <cp> in XLIFF.

genivia-inc commented 2 years ago

The latest commit shows how to naturally represent "illegal characters" in JSON. Because JSON has no illegal characters, there are no issues and there is no confusion how this mapping should work in practice.

The "Roundtripping XLIFF " example:

XLIFF:

<unit id="1">
  <segment>
    <source>Ctrl+C=<cp hex="0003"/></source>
  </segment>
</unit>

JLIFF:

"subunits": [
  {
    "id": "1",
    "kind": "segment",
    "source": [
      { "text": "Ctrl+C=\u0003" }
    ]
  }
]