Open afs opened 5 years ago
In my experiments speed ups of x2-x4 are common. This seems to be due to the better string encoding as "length,bytes" rather than per-character processing of a text format in a parser.
What exactly is the use Case?
@akuckartz - efficiency of large results.
@afs, I'd be also interested in learning more about the efficiency of Thrift encoding for RDF. Is there any overview of the efficiency gains achieved by this encoding in Jena?
I played with Apache Jena's riot
implementation of Thrift:
du -h test.ttl
# 9.6M test.ttl
riot --time --output TRDF test.ttl > test.trdf
# test.ttl : 2.70 sec : 153,676 Triples : 56,938.13 per second
du -h test.trdf
# 21M test.trdf
time gzip test.ttl
# real 0m2.347s
# user 0m0.092s
# sys 0m0.007s
du -h test.ttl.gz
# 344K test.ttl.gz
riot --time --sink test.trdf
# test.trdf : 0.80 sec : 153,676 Triples : 191,139.30 per second
riot --time --sink test.ttl.gz
# test.ttl.gz : 1.24 sec : 153,676 Triples : 123,732.69 per second
What this simplistic exploration suggests to me is that although Thrift is a bit faster to parse than a GZipped format (0.80 sec vs. 1.24 sec), it requires significantly more bandwidth to transfer (21M vs. 344K). Can it indicate that Thrift is a good fit for cases with low network latency, but expensive CPU time?
Hi @jindrichmynarz - it looks like 150k triples isn't enough to show the effects because of the Java warmup costs. There are two ways to get figures - either a small java program that calls riot.main
several times, and take the figures from the last run so the class loading and JIT is not involved in that timing run or use larger data to amortize the class loading and JIT costs. Using larger data is easier. The comment below has figures from using 25m triples - the largest file is fastest.
For returning SPARQL results, compression would be "one time" so in many cases increases the overall time. Latency is less of a factor than bandwidth. If the system is in a datacenter, as compared to the public internet, obvious the bandwidth is better.
Writing is not faster. Work is done in the writing to make reading efficient and (mainly API issues) can involve work on the write path. That said, writing can be done in on a separate thread.
Intel® Core™ i7-4770 CPU @ 3.40GHz × 8
32G RAM; about 20G free RAM for file system cache.
Ubuntu 19.04
SSD SATA-3
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu1, mixed mode, sharing) Apache Jena 3.11.0
File | File Size | Rate |
---|---|---|
bsbm-25m.nt | 6.1G | 251 kTPS |
bsbm-25m.nt.gz | 660M | 232 kTPS |
bsbm-25m.trdf | 6.3G | 977 kTPS |
bsbm-25m.trdf.gz | 684M | 757 kTPS |
kTPS = thousands triples per seconds. Second run from figures below.
riot
riot --sink
is doing parsing and creating all the RDF terms and triples, which are then passed to a null output stream.
Run once per file:
riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 102.43 sec : 24,997,044 Triples : 244,052.17 per second
riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 107.80 sec : 24,997,044 Triples : 231,892.13 per second
riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 26.91 sec : 24,997,044 Triples : 928,981.86 per second
riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 32.67 sec : 24,997,044 Triples : 765,067.30 per second
Run twice per file: disk cache warming: second figure is the more stable result:
riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 99.46 sec : 24,997,044 Triples : 251,335.19 per second
riot --time --sink bsbm-25m.nt
bsbm-25m.nt : 99.29 sec : 24,997,044 Triples : 251,762.99 per second
riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 108.44 sec : 24,997,044 Triples : 230,506.47 per second
riot --time --sink bsbm-25m.nt.gz
bsbm-25m.nt.gz : 107.55 sec : 24,997,044 Triples : 232,424.70 per second
riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 26.04 sec : 24,997,044 Triples : 959,800.49 per second
riot --time --sink bsbm-25m.trdf
bsbm-25m.trdf : 25.58 sec : 24,997,044 Triples : 977,325.10 per second
riot --time --sink bsbm-25m.trdf.gz ; riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 33.01 sec : 24,997,044 Triples : 757,187.90 per second
riot --time --sink bsbm-25m.trdf.gz ; riot --time --sink bsbm-25m.trdf.gz
bsbm-25m.trdf.gz : 33.00 sec : 24,997,044 Triples : 757,417.33 per second
Separate "gzip -d":
gzip -d < bsbm-25m.trdf.gz | riot --time --sink --syntax trdf --
- : 30.50 sec : 24,997,044 Triples : 819,602.09 per second
gzip -d < bsbm-25m.nt.gz | riot --time --sink --syntax nt --
- : 101.90 sec : 24,997,044 Triples : 245,316.78 per second
Default gzip settings.
time gzip < bsbm-25m.trdf > bsbm-25m.trdf.gz
real 1m27.955s => 284 kTPS
user 1m26.893s
sys 0m1.028s
There's also a binary RDF format implemented by RDF4J.
I would also add http://www.rdfhdt.org/ and Flatbuffers https://google.github.io/flatbuffers/
How about EXI which is a bit coded XML variant, related to COAP for constrained devices? I think it's used in some Web of Things developments.
Also might want to consider Apache Arrow
In addition to the RDF4J Binary RDF format, there is also a RDF4J binary format for query results.
@jeenbroekstra , I got a 404 on that first link, but RDF4J Binary RDF format (https://rdf4j.org/documentation/reference/rdf4j-binary/) worked.
Also might want to consider Apache Arrow
It is also worth to mention Arrow Flight SQL -- an RPC protocol for SQL DBs with a huge emphasis on zero-copy data, Arrow Flight RPC as data transport and Apache Arrow ADBC on the client side with a columnar API.
Binary encoding for formats with a significant proportion of strings are considerably faster for protocol transfer, especially on reading.
This feature would provide new formats for content types used with SPARQL, using +binary on the MIME type and where
+binary
is one of a small number of documented choices.Deciding on one choice is a thankless task. Inventing a new one is both hard and time consuming and maintaining one is a burden.
Therefore some common existing formats should be used and SPARQL results described in each of these formats.
Incomplete list of possibilities (major systems only):
Why?
Binary formats are more efficient at scale.
Compression should not be bundled in with this feature. Compression can slow down results transfer and HTTP already has mechanisms to ask for compression.
Previous work
Jena uses Apache Thrift (because of coverage in many programming languages). Speed experiments did not show a significant speed difference.
https://jena.apache.org/documentation/io/rdf-binary.html
Considerations for backward compatibility
None, if content negotiation is correctly performed.
"Good practice" should be to provide the existing text-base types as well.