This is a follow-up of what we discussed on the last Nano Session.
The idea is to use Jelly, a high-performance RDF serialization format and streaming protocol for communication between microservices in the next generation of the Nanopub infrastructure.
Jelly can act as a simple serialization format (like N-Triples or Turtle), turning a bunch of triples/quads into bytes. But, it can also serialize a stream of RDF graphs/datasets, which is where its main advantage is. Let's say we have a series of nanopubs (RDF datasets) – if we serialized them as a bunch of TriG files, we would have to each time repeat the same prefixes, property names, classes, etc. If we could not repeat ourselves, we could reduce the size of the serialized data, and also speed things up (less work = faster). See the benchmarks below for numbers.
Jelly is currently implemented for Jena 5 and RDF4J 5, with a full integration with the relevant I/O APIs. The implementation is open-source (Apache 2.0 license) and exhaustively tested in CI/CD. When you just want to serialize one RDF dataset, all you need to do is to add a Maven dependency on Jelly and use the standard RDF4J Rio API. That's all, it just works.
To implement streams of RDF datasets, you can either reuse Jelly's jelly-stream module, which is based on Apache Pekko Streams, or write something on your own. jelly-stream can be trivially integrated with gRPC (module jelly-grpc provides a ready pub/sub gRPC service), Kafka, WebSocket, MQTT, or whatever else supported by Pekko.
Design
Okay, so where could it be used? In short – anywhere where there are microservices talking to each other. Quick mockup:
Nanopub registry -> registry replication.
We could send new nanopubs in batches – one HTTP GET response would correspond to several new nanopubs.
Alternatively, we could maintain a real-time stream with gRPC (HTTP/2 transport) or WebSocket that would send the new nanopubs exactly at the moment they are added. This would remove polling entirely. Latencies would drop dramatically.
Nanopub registry -> nanopub query
Same mechanism as above
Nanopub registry dumps
It should be much faster with Jelly to make a huge nanopublication dump than with other formats. It will also be way faster to load such a dump when spinning up a new registry server.
Technical
I'd rate this as "very doable", because Nanopub Registry already uses RDF4J 5.0.2, with which Jelly is fully compatible. I've already implemented similar stuff for another RDF4J app (sorry, private code, can't share it), and it works fine with MQTT, Kafka, and long-running gRPC streams.
The scenario considered here is "grouped RDF streaming", so transmitting a series of discrete RDF datasets. In our case, 1 dataset = 1 nanopub. The benchmarks without the network stack were repeated 15 times, first 5 runs discarded to account for JVM warmup. With network: 8 runs, first 3 discarded.
Hardware: AMD Ryzen 9 7900 (12-core, 24-thread, 5.0 GHz); 64 GB RAM (DDR5 5600 MT/s). The disk was not used during the benchmarks (all data was in memory). The throughput benchmarks are single-threaded, but the JVM was allowed to use all available cores for garbage collection, JIT compilation, and other tasks.
Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a null byte stream.
Higher is better.
Jena:
RDF4J:
Deserializing (parsing) throughput
Deserializing 100k nanopubs from a memory-backed byte stream to a series of Iterable[Statement], where each iterable is one RDF dataset. Constructing an RDF4J Model object is not included here, because that involves creating hashmaps and whatnots and is not the concern of the serialization format itself.
Jena:
RDF4J:
Serialized representation size
Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a byte-counting stream.
Less is better.
Streaming over the network – Kafka
One producer sending 10k nanopubs over Kafka (1 RDF dataset = 1 Kafka message) to one consumer. All software is on the same host (unlimited bandwidth). I have this benchmark only for Jena, but the results should be very similar for RDF4J.
Higher is better.
Same, with producer network bandwidth limited to 100 Mbit/s, with a 10 ms one-way latency. The consumer had unlimited network to the broker.
These results aren't that great, but it's possible to get more with gRPC. Also, you could use better compression that gzip.
Streaming over the network – gRPC
Same case as in Kafka, but this is a direct point-to-point connection using gRPC. Here there are only results for Jelly, gRPC is more-or-less directly integrated with Protocol Buffers, so it would be pretty hard to integrate the other formats with gRPC.
Unlimited network:
100 Mbit/s, 10 ms one-way latency:
My involvement
I would be glad to help with implementing this. I have already implemented Jelly in a few Jena/RDF4J apps, so this shouldn't be too hard.
If you want me to test Jelly with a specific network environment (bandwidth, latency), streaming protocol, and compression settings, let me know. Currently the tests only use gzip, but it's a horrible compression algorithm. I think the best-best combination would be zstd + gRPC.
This is a follow-up of what we discussed on the last Nano Session.
The idea is to use Jelly, a high-performance RDF serialization format and streaming protocol for communication between microservices in the next generation of the Nanopub infrastructure.
Jelly can act as a simple serialization format (like N-Triples or Turtle), turning a bunch of triples/quads into bytes. But, it can also serialize a stream of RDF graphs/datasets, which is where its main advantage is. Let's say we have a series of nanopubs (RDF datasets) – if we serialized them as a bunch of TriG files, we would have to each time repeat the same prefixes, property names, classes, etc. If we could not repeat ourselves, we could reduce the size of the serialized data, and also speed things up (less work = faster). See the benchmarks below for numbers.
Jelly is currently implemented for Jena 5 and RDF4J 5, with a full integration with the relevant I/O APIs. The implementation is open-source (Apache 2.0 license) and exhaustively tested in CI/CD. When you just want to serialize one RDF dataset, all you need to do is to add a Maven dependency on Jelly and use the standard RDF4J Rio API. That's all, it just works.
To implement streams of RDF datasets, you can either reuse Jelly's
jelly-stream
module, which is based on Apache Pekko Streams, or write something on your own.jelly-stream
can be trivially integrated with gRPC (modulejelly-grpc
provides a ready pub/sub gRPC service), Kafka, WebSocket, MQTT, or whatever else supported by Pekko.Design
Okay, so where could it be used? In short – anywhere where there are microservices talking to each other. Quick mockup:
Technical
I'd rate this as "very doable", because Nanopub Registry already uses RDF4J 5.0.2, with which Jelly is fully compatible. I've already implemented similar stuff for another RDF4J app (sorry, private code, can't share it), and it works fine with MQTT, Kafka, and long-running gRPC streams.
Benchmarks
The benchmarks on the Jelly website were conducted with a mix of 13 different datasets, one of them is a nanopublication dump. Here I post the dis-aggregated results only for the nanopub dataset.
The scenario considered here is "grouped RDF streaming", so transmitting a series of discrete RDF datasets. In our case, 1 dataset = 1 nanopub. The benchmarks without the network stack were repeated 15 times, first 5 runs discarded to account for JVM warmup. With network: 8 runs, first 3 discarded.
Hardware: AMD Ryzen 9 7900 (12-core, 24-thread, 5.0 GHz); 64 GB RAM (DDR5 5600 MT/s). The disk was not used during the benchmarks (all data was in memory). The throughput benchmarks are single-threaded, but the JVM was allowed to use all available cores for garbage collection, JIT compilation, and other tasks.
Software: Linux kernel 6.10.11, Oracle GraalVM 23.0.1+11.1, Apache Jena 5.2.0, Eclipse RDF4J 5.0.2, Jelly-JVM 2.2.2. Benchmark code: https://github.com/Jelly-RDF/jvm-benchmarks/tree/dd58f5de0916c1223ca115052567c1fb39f4cd62
Serialization (writing) throughput
Serializing 100k nanopubs from a series of Jena
DatasetGraph
/ RDF4JModel
objects to a null byte stream.Higher is better.
Jena:
RDF4J:
Deserializing (parsing) throughput
Deserializing 100k nanopubs from a memory-backed byte stream to a series of
Iterable[Statement]
, where each iterable is one RDF dataset. Constructing an RDF4JModel
object is not included here, because that involves creating hashmaps and whatnots and is not the concern of the serialization format itself.Jena:
RDF4J:
Serialized representation size
Serializing 100k nanopubs from a series of Jena
DatasetGraph
/ RDF4JModel
objects to a byte-counting stream.Less is better.
Streaming over the network – Kafka
One producer sending 10k nanopubs over Kafka (1 RDF dataset = 1 Kafka message) to one consumer. All software is on the same host (unlimited bandwidth). I have this benchmark only for Jena, but the results should be very similar for RDF4J.
Higher is better.
Same, with producer network bandwidth limited to 100 Mbit/s, with a 10 ms one-way latency. The consumer had unlimited network to the broker.
These results aren't that great, but it's possible to get more with gRPC. Also, you could use better compression that gzip.
Streaming over the network – gRPC
Same case as in Kafka, but this is a direct point-to-point connection using gRPC. Here there are only results for Jelly, gRPC is more-or-less directly integrated with Protocol Buffers, so it would be pretty hard to integrate the other formats with gRPC.
Unlimited network:
100 Mbit/s, 10 ms one-way latency:
My involvement
I would be glad to help with implementing this. I have already implemented Jelly in a few Jena/RDF4J apps, so this shouldn't be too hard.
If you want me to test Jelly with a specific network environment (bandwidth, latency), streaming protocol, and compression settings, let me know. Currently the tests only use gzip, but it's a horrible compression algorithm. I think the best-best combination would be zstd + gRPC.
So...
What do you think?