Unicode Separated Values support

domel commented 4 months ago

I propose an enhancement to how quoted triples/triple terms are currently handled within the context of our data representation, specifically regarding their integration into CSV and TSV formats. As you are aware, both CSV and TSV are flat data formats that are doing poorly to support the nested nature of quoted triples/triple terms. This limitation poses a significant challenge in representing hierarchical data structures in a tabular form, which is a crucial requirement for various data exchange and processing scenarios.

Quoted triples/triple terms, given their nested character, require a more flexible and inherently hierarchical format to be represented efficiently in a tabular manner. To address this, I propose the adoption of Unicode Separated Values (USV) as a new result format for handling such cases.

Unicode Separated Values (USV) is a data format (the IETF is currently working on the appropriate RFC) designed for exchanging and converting data between various spreadsheet programs, databases, and streaming data services. The key advantage of USV over traditional flat formats like CSV and TSV is its ability to define groups, which can more naturally represent the structure of quoted triples/triple terms within a tabular context. This enhancement would facilitate a more intuitive and effective method of data representation and exchange, especially in applications involving complex graph-based data structures.

See also:

Spec
Docs

domel commented 4 months ago

For example:

x<US>quoted<ESC>
"Alice"<US><RS>http://example/alice<US>http://example/knows<US>http://example/bob<RS><ESC>
"Bob"<US><RS>http://example/bob<US>http://example/knows<US>http://example/alice<RS><ESC>
"Carol"<US><RS>http://example/carol<US>http://example/says<US>""Hello world, my name is """"Alice"""".""<RS><ESC>

afs commented 4 months ago

It looks like a useful format for results transmission machine-to-machine where signifcant literal strings are involved. However, it's still a draft. What is the uptake?

An approach could be to define an abstraction of resultset-vars-rows-cells then map this to CSV, TSV, USV and other realisation formats (not affect JSON and XML because they exist already and compatibity matters).

This would be a framework for binary forms e.g. protobuf which is significantly faster to process due to the length-indicators on strings.

domel commented 4 months ago

The concept you've described, particularly focusing on the abstraction of resultset-vars-rows-cells for mapping to various realization formats like CSV, TSV, and USV, and its potential application to binary forms such as protobuf, presents a forward-thinking approach to data serialization and interchange. The idea of establishing a high-level abstraction for data representation that could seamlessly adapt to multiple formats is indeed promising.

domel commented 4 months ago

USV seems to be quite popular (for its age), including its other variants ASCII Separated Values (ASV) a.k.a. DEL (Delimited ASCII).

w3c / sparql-results-csv-tsv

Unicode Separated Values support #31