ivoa-std / VOTable

VOTable Format Definition
4 stars 15 forks source link

Update VOTable to handle UTF-8 #55

Open Zarquan opened 9 months ago

Zarquan commented 9 months ago

As part of the group looking at updating our standards to be compatible with 2020 technologies, I propose that we update the VOTable standard to handle the full UTF-8 characters set.

Issue DALI#33 is looking at adding support for xtype="json".

If we do adopt this new xtype, it allows a client to create a VOTable column with datatype="unicodeChar", arraysize="*", xtype="json".

This implies that the client can populate this column with ANY valid JSON document and upload it to a TAP service. Including JSON content that contains UTF-8 characters.

Using the current VOTable standard, some of the UTF-8 characters may end up being truncated to fit into the UTF-2 character set. Which is not the expected behaviour.

To resolve this:

  1. Any changes to the DALI documents that propose xtype="json" MUST include a caveat in the text that explicitly restricts the JSON content to the UTF-2 character set.
  2. We work to develop a new version of the VOTable standard which includes support for the full UTF-8 character set.
Zarquan commented 9 months ago

(1) is deliberately awkward.

Making the restriction glaringly obvious in the DALI document prevents us from endorsing xtype="json" in DALI and promising to fix VOTable at a later date without doing anything about it.

msdemlei commented 9 months ago

On Fri, Dec 15, 2023 at 08:03:40AM -0800, Zarquan wrote:

Using the current VOTable standard, some of the UTF-8 characters may end up being truncated to fit into the UTF-2 character set. Which is not the expected behaviour.

First off, I'd truly like to get rid of VOTable's UCS-2 legacy (it's been obsolete for ages), too. But given that unicodeChar is a bit of an oddity, I don't think we want to do a non-compatible (major-version-pushing) VOTable change just because of this.

But then we shouldn't be using VOTable Unicode encodings for JSON anyway.

To resolve this:

  1. Any changes to the DALI documents that propose xtype="json" MUST include a caveat in the text that explicitly restricts the JSON content to the UTF-2 character set.

No, we should say that people use char and avoid unicodeChar for JSON (I'd probably even forbid unicodeChar). JSON is designed so it can work with pure ASCII, and we should make that a must in order to not paint us into the ugly UCS-2 corner of unicodeChar. RFC 8259 says:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A through F can be uppercase or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

  1. We work to develop a new version of the VOTable standard which includes support for the full UTF-8 character set.

I'll not stop you, but note that in the entire metadata part you can already use whatever unicode you want, it's just in unicodeChar FIELD data that you're not allowed to (and that you can't in BINARY(2)).

If you ask me: We should just allow UTF-8 in char[] BINARY2 fields, use native encoding in char[] TABLEDATA and deprecate unicodeChar (and BINARY, but that's tangential).