knowledgepixels / nanopub-registry

MIT License
1 stars 0 forks source link

Consider using Jelly for inter-service communication #29

Open Ostrzyciel opened 2 hours ago

Ostrzyciel commented 2 hours ago

This is a follow-up of what we discussed on the last Nano Session.

The idea is to use Jelly, a high-performance RDF serialization format and streaming protocol for communication between microservices in the next generation of the Nanopub infrastructure.

Jelly can act as a simple serialization format (like N-Triples or Turtle), turning a bunch of triples/quads into bytes. But, it can also serialize a stream of RDF graphs/datasets, which is where its main advantage is. Let's say we have a series of nanopubs (RDF datasets) – if we serialized them as a bunch of TriG files, we would have to each time repeat the same prefixes, property names, classes, etc. If we could not repeat ourselves, we could reduce the size of the serialized data, and also speed things up (less work = faster). See the benchmarks below for numbers.

Jelly is currently implemented for Jena 5 and RDF4J 5, with a full integration with the relevant I/O APIs. The implementation is open-source (Apache 2.0 license) and exhaustively tested in CI/CD. When you just want to serialize one RDF dataset, all you need to do is to add a Maven dependency on Jelly and use the standard RDF4J Rio API. That's all, it just works.

To implement streams of RDF datasets, you can either reuse Jelly's jelly-stream module, which is based on Apache Pekko Streams, or write something on your own. jelly-stream can be trivially integrated with gRPC (module jelly-grpc provides a ready pub/sub gRPC service), Kafka, WebSocket, MQTT, or whatever else supported by Pekko.

Design

Okay, so where could it be used? In short – anywhere where there are microservices talking to each other. Quick mockup:

nanopub_jelly_mockup

Technical

I'd rate this as "very doable", because Nanopub Registry already uses RDF4J 5.0.2, with which Jelly is fully compatible. I've already implemented similar stuff for another RDF4J app (sorry, private code, can't share it), and it works fine with MQTT, Kafka, and long-running gRPC streams.

Benchmarks

The benchmarks on the Jelly website were conducted with a mix of 13 different datasets, one of them is a nanopublication dump. Here I post the dis-aggregated results only for the nanopub dataset.

The scenario considered here is "grouped RDF streaming", so transmitting a series of discrete RDF datasets. In our case, 1 dataset = 1 nanopub. The benchmarks without the network stack were repeated 15 times, first 5 runs discarded to account for JVM warmup. With network: 8 runs, first 3 discarded.

Hardware: AMD Ryzen 9 7900 (12-core, 24-thread, 5.0 GHz); 64 GB RAM (DDR5 5600 MT/s). The disk was not used during the benchmarks (all data was in memory). The throughput benchmarks are single-threaded, but the JVM was allowed to use all available cores for garbage collection, JIT compilation, and other tasks.

Software: Linux kernel 6.10.11, Oracle GraalVM 23.0.1+11.1, Apache Jena 5.2.0, Eclipse RDF4J 5.0.2, Jelly-JVM 2.2.2. Benchmark code: https://github.com/Jelly-RDF/jvm-benchmarks/tree/dd58f5de0916c1223ca115052567c1fb39f4cd62

Serialization (writing) throughput

Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a null byte stream.

Higher is better.

Jena:

Pasted image 20241119180925

RDF4J:

Pasted image 20241119180645

Deserializing (parsing) throughput

Deserializing 100k nanopubs from a memory-backed byte stream to a series of Iterable[Statement], where each iterable is one RDF dataset. Constructing an RDF4J Model object is not included here, because that involves creating hashmaps and whatnots and is not the concern of the serialization format itself.

Jena:

Pasted image 20241119181003

RDF4J:

Pasted image 20241119180742

Serialized representation size

Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a byte-counting stream.

Less is better.

Pasted image 20241119181511

Streaming over the network – Kafka

One producer sending 10k nanopubs over Kafka (1 RDF dataset = 1 Kafka message) to one consumer. All software is on the same host (unlimited bandwidth). I have this benchmark only for Jena, but the results should be very similar for RDF4J.

Higher is better.

Pasted image 20241119181840

Same, with producer network bandwidth limited to 100 Mbit/s, with a 10 ms one-way latency. The consumer had unlimited network to the broker.

Pasted image 20241119181942

These results aren't that great, but it's possible to get more with gRPC. Also, you could use better compression that gzip.

Streaming over the network – gRPC

Same case as in Kafka, but this is a direct point-to-point connection using gRPC. Here there are only results for Jelly, gRPC is more-or-less directly integrated with Protocol Buffers, so it would be pretty hard to integrate the other formats with gRPC.

Unlimited network:

Pasted image 20241119185600

100 Mbit/s, 10 ms one-way latency:

Pasted image 20241119185606

My involvement

I would be glad to help with implementing this. I have already implemented Jelly in a few Jena/RDF4J apps, so this shouldn't be too hard.

If you want me to test Jelly with a specific network environment (bandwidth, latency), streaming protocol, and compression settings, let me know. Currently the tests only use gzip, but it's a horrible compression algorithm. I think the best-best combination would be zstd + gRPC.

So...

What do you think?

Ostrzyciel commented 2 hours ago

@tkuhn pinging, just in case :)