Binary encodings for SPARQL results

afs commented 5 years ago

Binary encoding for formats with a significant proportion of strings are considerably faster for protocol transfer, especially on reading.

This feature would provide new formats for content types used with SPARQL, using +binary on the MIME type and where +binary is one of a small number of documented choices.

application/sparql-results+binary
application/n-triples+binary

Deciding on one choice is a thankless task. Inventing a new one is both hard and time consuming and maintaining one is a burden.

Therefore some common existing formats should be used and SPARQL results described in each of these formats.

Incomplete list of possibilities (major systems only):

Why?

Binary formats are more efficient at scale.

Compression should not be bundled in with this feature. Compression can slow down results transfer and HTTP already has mechanisms to ask for compression.

Previous work

Jena uses Apache Thrift (because of coverage in many programming languages). Speed experiments did not show a significant speed difference.

https://jena.apache.org/documentation/io/rdf-binary.html

Considerations for backward compatibility

None, if content negotiation is correctly performed.

"Good practice" should be to provide the existing text-base types as well.

afs commented 5 years ago

In my experiments speed ups of x2-x4 are common. This seems to be due to the better string encoding as "length,bytes" rather than per-character processing of a text format in a parser.

akuckartz commented 5 years ago

What exactly is the use Case?

afs commented 5 years ago

@akuckartz - efficiency of large results.

jindrichmynarz commented 5 years ago

@afs, I'd be also interested in learning more about the efficiency of Thrift encoding for RDF. Is there any overview of the efficiency gains achieved by this encoding in Jena?

I played with Apache Jena's riot implementation of Thrift:

du -h test.ttl
# 9.6M  test.ttl

riot --time --output TRDF test.ttl > test.trdf
# test.ttl :  2.70 sec : 153,676 Triples : 56,938.13 per second

du -h test.trdf
# 21M   test.trdf

time gzip test.ttl
# real  0m2.347s
# user  0m0.092s
# sys   0m0.007s

du -h test.ttl.gz
# 344K  test.ttl.gz

riot --time --sink test.trdf
# test.trdf :  0.80 sec : 153,676 Triples : 191,139.30 per second

riot --time --sink test.ttl.gz
# test.ttl.gz :  1.24 sec : 153,676 Triples : 123,732.69 per second

What this simplistic exploration suggests to me is that although Thrift is a bit faster to parse than a GZipped format (0.80 sec vs. 1.24 sec), it requires significantly more bandwidth to transfer (21M vs. 344K). Can it indicate that Thrift is a good fit for cases with low network latency, but expensive CPU time?

afs commented 5 years ago

Hi @jindrichmynarz - it looks like 150k triples isn't enough to show the effects because of the Java warmup costs. There are two ways to get figures - either a small java program that calls riot.main several times, and take the figures from the last run so the class loading and JIT is not involved in that timing run or use larger data to amortize the class loading and JIT costs. Using larger data is easier. The comment below has figures from using 25m triples - the largest file is fastest.

For returning SPARQL results, compression would be "one time" so in many cases increases the overall time. Latency is less of a factor than bandwidth. If the system is in a datacenter, as compared to the public internet, obvious the bandwidth is better.

Writing is not faster. Work is done in the writing to make reading efficient and (mainly API issues) can involve work on the write path. That said, writing can be done in on a separate thread.

afs commented 5 years ago

Parsing BSBM : 25 million N-triples

System

Intel® Core™ i7-4770 CPU @ 3.40GHz × 8
32G RAM; about 20G free RAM for file system cache. Ubuntu 19.04
SSD SATA-3

OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu1, mixed mode, sharing) Apache Jena 3.11.0

Summary

File	File Size	Rate
bsbm-25m.nt	6.1G	251 kTPS
bsbm-25m.nt.gz	660M	232 kTPS
bsbm-25m.trdf	6.3G	977 kTPS
bsbm-25m.trdf.gz	684M	757 kTPS

kTPS = thousands triples per seconds. Second run from figures below.

`riot`

riot --sink is doing parsing and creating all the RDF terms and triples, which are then passed to a null output stream.

Results

Run once per file:

riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 102.43 sec : 24,997,044 Triples : 244,052.17 per second

riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 107.80 sec : 24,997,044 Triples : 231,892.13 per second

riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 26.91 sec : 24,997,044 Triples : 928,981.86 per second

riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 32.67 sec : 24,997,044 Triples : 765,067.30 per second

Run twice per file: disk cache warming: second figure is the more stable result:

riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 99.46 sec : 24,997,044 Triples : 251,335.19 per second
riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 99.29 sec : 24,997,044 Triples : 251,762.99 per second

riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 108.44 sec : 24,997,044 Triples : 230,506.47 per second
riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 107.55 sec : 24,997,044 Triples : 232,424.70 per second

riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 26.04 sec : 24,997,044 Triples : 959,800.49 per second
riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 25.58 sec : 24,997,044 Triples : 977,325.10 per second

riot --time --sink bsbm-25m.trdf.gz ; riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 33.01 sec : 24,997,044 Triples : 757,187.90 per second
riot --time --sink bsbm-25m.trdf.gz ; riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 33.00 sec : 24,997,044 Triples : 757,417.33 per second

Separate "gzip -d":

gzip -d < bsbm-25m.trdf.gz | riot --time --sink --syntax trdf --
- : 30.50 sec : 24,997,044 Triples : 819,602.09 per second

gzip -d < bsbm-25m.nt.gz | riot --time --sink --syntax nt --
- : 101.90 sec : 24,997,044 Triples : 245,316.78 per second

Notes

"gzip -d |", including the bytes written over the pipe, is faster than java .gz inline.
Applying compression is expensive
Compression ratio is x10

Compression

Default gzip settings.

time gzip < bsbm-25m.trdf > bsbm-25m.trdf.gz
real   1m27.955s  => 284 kTPS
user   1m26.893s
sys    0m1.028s

jindrichmynarz commented 5 years ago

There's also a binary RDF format implemented by RDF4J.

ebremer commented 5 years ago

I would also add http://www.rdfhdt.org/ and Flatbuffers https://google.github.io/flatbuffers/

VladimirAlexiev commented 5 years ago

How about EXI which is a bit coded XML variant, related to COAP for constrained devices? I think it's used in some Web of Things developments.

rvesse commented 5 years ago

Also might want to consider Apache Arrow

abrokenjester commented 4 years ago

In addition to the RDF4J Binary RDF format, there is also a RDF4J binary format for query results.

ericprud commented 4 years ago

@jeenbroekstra , I got a 404 on that first link, but RDF4J Binary RDF format (https://rdf4j.org/documentation/reference/rdf4j-binary/) worked.

amivanoff commented 1 year ago

Also might want to consider Apache Arrow

It is also worth to mention Arrow Flight SQL -- an RPC protocol for SQL DBs with a huge emphasis on zero-copy data, Arrow Flight RPC as data transport and Apache Arrow ADBC on the client side with a columnar API.

w3c / sparql-dev