Challenges in ingest performance

shoffmeister commented 2 months ago

I am experimenting with a single-partition Kafka topic, on a local Kafka broker, containing synthetic test data (following a complex schema) at the scale of 1.5 million records.

I notice a large performance difference between

use kwack to ingest from the Kafka topic, given a simple JSON Schema ("kwack")

dump full topic content to disk, ingest natively in DuckDB ("native")

kcat -C -J -e -q -b localhost:9192 -t MYTOPIC | jq -c ' .payload | fromjson ' > MYTOPIC.json
./duckdb

followed by

create table MYTOPIC as select * from read_json_auto('MYTOPIC.json');

Results:

"kwack" == 76 minutes (at 4.4 GB RSS)
"native" == 6 minutes for dump, 10 seconds for read_json_auto (at 6.6 GB RSS)

Is that the "price to pay" for native Kafka interconnect?

rayokota commented 2 months ago

Thanks @shoffmeister, I believe some of the overhead is from going through the DuckDB JDBC driver. I'll try to look for perf optimizations when I get a chance.

shoffmeister commented 2 months ago

FWIW, I just noticed incorrect reporting from me: This was a single broker, 12 partition configuration. The kwack process was limited to consuming about 120% CPU (while the broker was bored).

I'll run this against a multi-broker cluster and see whether this would scale out to 300%-ish CPU (which would then use 3 CPUs of the available 16 CPUs)

shoffmeister commented 2 months ago

Given a three-node Kafka cluster in KRaft mode, official Apache Kafka 3.8 "native" images (i.e. GraalVM), the performance characteristics of kwack do not change.

On the kwack process, CPU maxes out at 120%.

The Kafka brokers themselves are very bored.

My (virtual) box has excess physical memory left.

Screenshot from running btop:

Partial screenshot from visualvm:

shoffmeister commented 2 months ago

The current implementation does

                sql = "INSERT INTO '" + topic + "' VALUES (" + String.join(",", paramMarkers) + ")";
                PreparedStatement stmt = stmts.computeIfAbsent(sql, s -> {

Perhaps the Appender could be useful - see https://duckdb.org/docs/api/java.html#appender

shoffmeister commented 2 months ago

I have created a very simple Python script which consumes my experimental topic (see above) with maximum concurrency.

Using that script, dumping (key, value) of those 1.5 million Kafka records into DuckDB rows wholesale as raw strings takes 17 seconds.

I would guess that performance functionally comparable to what kwack does could go up to 25 seconds. The increase would be due to parsing value as JSON, and pumping all record data into a fitting table structure.

Case in point: It seems as if table insert performance in DuckDB is dominated by the amount of data written. Converting to JSON and then inserting only a single field from that JSON comes in at 15 seconds, is two seconds faster than the raw string insert (the raw string has about 5100 characters).

FWIW, this is meant as a very naïve sanity check on performance potential, not meant to criticize kwack.

shoffmeister commented 2 months ago

The Appender functionality exposed by the JDBC driver is brutally fast.

Two challenges:

the C API appender API is not fully routed through via JNI into the JDBC driver, it seems - fixable
there does not seem to be a way (in the C API, hence neither via JNI) to append custom types - i.e. no struct et al.

shoffmeister commented 2 months ago

Final update for now, here: I think the current public wisdom on Appender is collected around https://discord.com/channels/909674491309850675/1148659944669851849/1284527414524772384

https://sourcegraph.com/github.com/duckdb/duckdb/-/blob/test/api/capi/test_capi_data_chunk.cpp is the best docs, and https://github.com/Giorgi/DuckDB.NET/blob/develop/DuckDB.NET.Data/DuckDBAppender.cs / https://github.com/Giorgi/DuckDB.NET/blob/develop/DuckDB.NET.Data/Internal/Reader/StructVectorDataReader.cs show data chunking.

Fundamentally, a bit of an unchartered territory :)

rayokota commented 2 months ago

Thanks @shoffmeister . There is another issue which might impede progress, it seems that kwack tests hang when upgrading to 1.1.0. I've narrowed it down to a change in https://github.com/duckdb/duckdb-java/commit/55a5d7bc57f0a6894f8bb7b31084a74c9b42a34e (the previous commit https://github.com/duckdb/duckdb-java/commit/53fdd8396e3fbd539ee99865e4ebf912545c3d99 works fine) but have not gotten any further.

rayokota commented 1 month ago

The deadlock causing the kwack tests to hang has been identified as https://github.com/duckdb/duckdb-java/issues/101

rayokota / kwack

Challenges in ingest performance #52