Benchmark - Githubissues

lweides commented 11 months ago

Setup JMH
Define test #3
Define benchmark cases:
- read
- write (initialization)
Benchmark environment
- CPU
- Disk
- RAM
Run benchmarks
Analyze benchmark results
Visualize benchmark results

Eliasrpx commented 11 months ago

optional: introduce custom metrics:

bytes read
false positives
disk space: number of fields

lweides commented 10 months ago

Use-cases to benchmarks (may be extended in the future):

filter for (possibly unique) ids, categorized by a large data set and a small result set
filter for strings with supported operations
filter by timeframe
read a small subset of available columns
read all available columns
input bytes vs. output bytes - compression
ingest data of relatively stable column set
ingest data of relatively unstable column set

All benchmarks will be performed with JMH.

Filter for ids

Measure the performance of filtering for sparsely occurring ids in seconds and bytes read (if possible).

Filter for strings

Filter for strings with the supported operations provided by our API (IS, STARTS_WITH, ENDS_WITH, CONTAINS). Measure the performance in seconds and bytes read (if possible).

Filter by timeframe

Filter 2 long columns by treating them as start_time and end_time. Available filters are STARTS_IN, ENDS_IN, OVERLAPS. Measure the performance in seconds and bytes read (if possible).

Read a small subset of available columns

Read all records and measure the performance in seconds and bytes read (if possible).

Read all available columns

Read all records and measure the performance in seconds and bytes read (if possible).

Effectiveness of compression

Measure the ratio of input bytes vs. output bytes.

Ingest data of relatively stable column set

Ingest a dataset where the majority of records consists of the same columns. Measure both ingest performance in seconds and output bytes.

Ingest data of relatively unstable column set

Ingest a dataset where the majority of records consists of the different columns or a lot of column has no data (null values). Measure both ingest performance in seconds and output bytes.

lweides commented 10 months ago

Examples for the use-cases described above:

Filter for ids: collect and build a trace from individual spans
Filter for strings: search for exceptions in logs
Filter by timeframe: collect spans / traces / logs from a specific timeframe
Read a small subset of available columns: Read a subset of fields of a span / a trace
Read all available columns: Read all fields of a span / a trace
Effectiveness of compression: cost of storage
Ingest data of relatively stable column set: ingest of logs / spans from a single technology
Ingest data of relatively unstable column set: ingest spans from multiple technologies / json

Eliasrpx commented 8 months ago

Elias:

filter for strings with supported operations
Read all available columns
Ingest data of relatively unstable column set

lweides / column-store

Benchmark #1