ivoa-std / VOTable

VOTable Format Definition
4 stars 15 forks source link

Update text to explicitly state white space is preserved in strings #54

Closed Zarquan closed 8 months ago

Zarquan commented 8 months ago

As part of the group looking at updating our standards to be compatible with 2020 technologies, I propose that we update the VOTable standard to preserve spaces in char[*] and unicodeChar[*] columns.

Partly because this allows us to handle columns with xtype="yaml", but more importantly, the fact that VOTable does not preserve spaces in strings is a side effect of an old XML serialization schema, and should not be part of the standard.

I think is is a reasonable expectation for a client to be able to create a VOTable with a unicodeChar[*] column, send it to a TAP upload, use it in a JOIN query and then download the results, then the resulting unicodeChar[*] column MUST contain the same string of characters as the original, including preserving the white space.

msdemlei commented 8 months ago

On Fri, Dec 15, 2023 at 07:45:49AM -0800, Zarquan wrote:

As part of the group looking at updating our standards to be compatible with 2020 technologies, I propose that we update the VOTable standard to preserve spaces in char[*] and unicodeChar[*] columns.

Uh, since I'm probably to blame for this bug, let me put in a disclaimer: As far as the standard goes, we do preserve whitespace in all our encodings.

Whitespace might be mangled if external XML tools (e.g., a non-XSD-aware pretty-printer) are run on the TABLEDATA VOTables. People of course shouldn't do that.

But since of course we can't keep them from doing that, I'd still advocate that whitespace-robustness would be a bonus when we design new stuff.

Anyway: I think this bug can be closed as false-alarm; at least the thing I was mentioning over at DALI isn't something we can fix in VOTable.

Zarquan commented 8 months ago

It looks like the relevant part of the current specification is at the end of Section 5.1, TABLEDATA Serialization:

... while for numeric data types the amount of white spaces does not matter (...), the white space is significant for "char" or "unicodeChar" datatypes, and for instance <TD>Apple</TD> and <TD> Apple</TD> are not identical.

Which I think does say we should preserve white space, in the XML serialization, but to a non-technical reader it isn't that clear.

It would be clearer if we had a separate section for white-space that explicitly said that white space is preserved in "char" and "unicodeChar" datatypes in all serializations.

Zarquan commented 8 months ago

Digging deeper, white space is preserved in the XML serialization because the XML schema contains the following:

<xs:complexType name="Td">
  <xs:simpleContent>
    <xs:extension base="xs:string">
      .... 
      .... 
    </xs:extension>
  </xs:simpleContent>
</xs:complexType>

Which is technically correct, but not particularly easy to find.

Zarquan commented 8 months ago

Changed the title to reflect the fact that I didn't know that spaces are preserved.

Zarquan commented 8 months ago

Markus is right - my bad.