softdevteam / k2

Benchmark runner [Unmaintained]
Other
0 stars 0 forks source link

Results Storage #2

Open vext01 opened 5 years ago

vext01 commented 5 years ago

Yesterday @ltratt and I talked about how to store our results. We came to the rough conclusion that because the exact structure of the benchmark identifiers and the data are not known ahead of time, a NoSQL database might be the way to go.

I've spent some time reviewing the NoSQL ecosystem. Here are my findings.

[Note that we are only talking about results storage here. Laurie and I discussed yesterday: in K2 we don't want to have to inspect the results to know what to run next, so we'd need some kind of manifest file, like Krun uses perhaps. This would be a separate store, but might be a bonus if it could be stored in the same database as the results]

Our Requirements

We want the following things from our database:

Evaluation

So now let's look at off the shelf software that comes close.

It didn't take long for me to realise that our requirements are incredibly niche. For example:

To spare your sanity, I'm only going to list the 4 that best fit our requirements. [I must have looked at more than 20 systems and most are dead in the water for our requirements]

For each I'm only listing the cons, i.e. why we probably can't use it.

MongoDB: https://www.mongodb.com/

Unqlite: https://unqlite.org/

Ejdb (embeddable JSON database): https://ejdb.org/

SQLite: https://sqlite.org/index.html

A bit more about SQLite

SQLite just was recently extended with to allow querying of JSON structured data.

SQLite is a superb, tiny, ubiquitous, embeddable and concurrent, database system under the public domain license. And now it does JSON querying, so this is an attractive option.

The (potential) problem is that whereas many document-oriented DBs store JSON documents as BSON internally for speed, SQLite does not. However, they say:

The json1 extension does not (currently) support a binary encoding of JSON. Experiments have been unable to find a binary encoding that is significantly smaller or faster than a plain text encoding. (The present implementation parses JSON text at over 1 GB/s.) All json1 functions currently throw an error if any of their arguments are BLOBs because BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.

Intrigued by the idea that JSON parsing could be as fast as decoding a binary format, I started wondering why loading our Krun results take so long.

I took one of of large results files and timed loading it into memory using Python 2.7. First I loaded the results file from Krun's native .json.bz2 format, then I decompressed the file and loaded the resulting plain .json file:

From compressed JSON  : 15.268
From uncompressed JSON: 3.365

So nearly 80% of the load time is bzip2 decompression. Further, I expect sqlite will parse JSON faster than Python-2.7, but I'm not certain.

I wonder if we'd get away with using SQLite? I'd like to hear your opinions and if you think it's worthwhile, I can write a script to generate a SQLite database similar to what we expect to deal with, and then benchmark some queries.

vext01 commented 5 years ago

To give an idea how you'd query JSON in sqlite:

If we have this schema:

sqlite> .schema 
CREATE TABLE data (json text);

And we've loaded it with example data:

sqlite> select * from data;
{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 1, "ipi_wcts": []}
{"bench_path": "/benchmarks/btrees", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/btrees", "args": ["python3.7", "${bench_path}"], "pexec": 1, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["pypy", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["pypy", "${bench_path}"], "pexec": 1, "ipi_wcts": []}

[That structure isn't necessarily what I'm suggesting, but it'll do for example purposes].

[Ignore the empty results lists. Imagine they were populated].

Get all results for the fasta benchmark:

sqlite> select * from data where json_extract(json, '$.bench_path') like '%%fasta%%';
{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 1, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["pypy", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["pypy", "${bench_path}"], "pexec": 1, "ipi_wcts": []}

Get all the first process execution for all benchmarks:

sqlite> select * from data where json_extract(json, '$.pexec') = 0;
{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/btrees", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
{"bench_path": "/benchmarks/fasta", "args": ["pypy", "${bench_path}"], "pexec": 0, "ipi_wcts": []}
gabi-250 commented 5 years ago

Hi,

Thank you for taking the time to write down such a detailed description of your findings!

*Concurrent access from different processes. Because in some cases it's OK to run benchmarks concurrently.

Maybe this is a bit of a silly question, but wouldn't running benchmarks concurrently affect the measurements?

Ejdb (embeddable JSON database): https://ejdb.org/

  • Not clear if concurrent access from processes is possible.
  • Didn't build for me on either Linux or OpenBSD. Build system also fetches moving targets.
  • Beta quality.

I've managed to build Ejdb, although the instructions on the official page are missing a step. However, I've never used it before, so I'm neutral about it. There's a crate that provides bindings for it, so perhaps it's a (sort of) valid option.

SQLite: https://sqlite.org/index.html

  • Not clear if it will perform well. See below.

SQLite looks promising, and it is probably better than rolling our own storage system anyway. Since it looks like our best option, I think it makes sense to try it out.

ltratt commented 5 years ago

Maybe this is a bit of a silly question, but wouldn't running benchmarks concurrently affect the measurements?

It's not a silly question -- Edd and I had the same debate the other day! What we realised is that sometimes you don't care about wall-clock time. For example, let's say you're only interested in benchmarking memory consumption: you might be quite willing to assume that other processes don't affect any given benchmark, so you can run as many of them as you think you have RAM to do so.

vext01 commented 5 years ago

I've managed to build Ejdb, although the instructions on the official page are missing a step. However, I've never used it before, so I'm neutral about it. There's a crate that provides bindings for it, so perhaps it's a (sort of) valid option.

Yeah, the rust bindings look well-maintained, but the build system seems crusty and the software itself is beta-quality by its own admission.

SQLite looks promising, and it is probably better than rolling our own storage system anyway. Since it looks like our best option, I think it makes sense to try it out.

Agreed. sqlite is ubiquitous and has proven itself over the last decade or so. It's also widely packaged, trivial to build, and the binary format is future proofed by SQL dumps (the .dump command in the REPL).

vext01 commented 5 years ago

Where do you stand on this @ltratt?

gabi-250 commented 5 years ago

It's not a silly question -- Edd and I had the same debate the other day! What we realised is that sometimes you don't care about wall-clock time. For example, let's say you're only interested in benchmarking memory consumption: you might be quite willing to assume that other processes don't affect any given benchmark, so you can run as many of them as you think you have RAM to do so.

Interesting! I guess this means we need to store additional information for each benchmark (whether it only measures memory consumption, etc).

vext01 commented 5 years ago

Interesting! I guess this means we need to store additional information for each benchmark (whether it only measures memory consumption, etc).

Or have something like a -j flag which states how many concurrent benchmarks you are willing to allow. This of course would have to conflict with reboot mode.

But we are perhaps getting ahead of ourselves :) As long as the storage format doesn't block the possibility of concurrent benchmarking from distinct processes (although we might use threads, depending on our design choices), then I'm OK.

gabi-250 commented 5 years ago

Agreed, perhaps it's a bit early to worry about this!

ltratt commented 5 years ago

Where do you stand on this @ltratt?

If the two of you think you have a solution which satisfies our goals (concurrent access etc.), then I say let's go with it. I don't know enough about the various solutions to have much of an opinion at this point.

vext01 commented 5 years ago

I don't know enough about the various solutions to have much of an opinion at this point.

OK. Note that this is all new to me too! At least we can probably change the storage backend at a later time without too much hassle.

To investigate our performance concerns, why don't I write a script to convert one of our large Krun results files into an equivalent JSON document database in sqlite and then see a) how large the file is, b) how long it takes to load all the data into RAM, c) how long it takes to pluck only one arbitrary pexec's data?

ltratt commented 5 years ago

If that's not too difficult, then it sounds like a good idea to me.

BTW, one thing just occurred to me. Whatever we do, we need to separate out running benchmarks from recording stuff about them. In other words, if I use k2 to run benchmarks, I probably don't want it loading in an SQLite database before I start benchmarking. I wonder if this suggests we want to explicitly bring back the idea of a manifest file in some way?

vext01 commented 5 years ago

Yes, I agree. I mentioned this earlier:

[Note that we are only talking about results storage here. Laurie and I discussed yesterday: in K2 we don't want to have to inspect the results to know what to run next, so we'd need some kind of manifest file, like Krun uses perhaps. This would be a separate store, but might be a bonus if it could be stored in the same database as the results]

Note that an sqlite database probably isn't loaded into memory in its entirety at open time. Then again, we don't really know exactly what it does. If we write our own manifest format, then we have full control.

vext01 commented 5 years ago

this post suggests that sqlite loads data on demand, but there is a cache. Since the cache fill is going to be variable over time, I suggest we don't use sqlite for manifests.

vext01 commented 5 years ago

In the spirit of separating configuration from results, I don't think the results storage should contain too much configuration information.

For example, earlier I had records like:

{"bench_path": "/benchmarks/fasta", "args": ["python3.7", "${bench_path}"], "pexec": 0, "ipi_wcts": []}

The VM arguments are not necessary for the results file. So we'd probably prefer something like:

{"bench_path": "/benchmarks/fasta", "lang": "python3.7, "pexec": 0, "ipi_wcts": []}

[Some part of me thinks the path to the benchmark is an irrelevant detail too, but we've opted to use the benchmark path an an identifier]

Another thing that's playing on my mind: I used one record per-pexec. Another option would be to have one record for all pexecs of a given configuration. The former is better from a concurrency POV, the latter would make a smaller database.

The latter would look like:

{"bench_path": "/benchmarks/fasta", "lang": "python3.7, "pexecs": [[...], [...], [...], ...]}
vext01 commented 5 years ago

If that's not too difficult, then it sounds like a good idea to me.

OK, I've had a play. This is looking pretty promising!

Here's my database schema:

sqlite> .schema
CREATE TABLE data (json TEXT);

Each record is one process execution, like this:

sqlite> select * from data limit 1;
{"bench": "spectralnorm", "ipi_wcts": [0.490794, 0.478115, 0.478015, 0.478041, ...], "pexec": 0, "lang": "HotSpot"}

I've imported one machine's worth of data from the warmup experiment into such a database:

sqlite> select count(*) from data;
1350

Database size

$ du -h results.db
27M     results.db

This could be smaller if we used one record for all pexecs of a given configuration. Much of this size is probably repetition of the field names in the JSON. Sqlite does not compress.

If we use one record for all pexecs, then we make the database less concurrency friendly.

Measuring speed

Here's a little python script to query the data. It loads all of the data for the matching "filters" into a list before printing the length of the list:

#!/usr/bin/env python3

import sqlite3
import sys

def main(db, filters):
    curs = db.cursor()
    rs = []

    sql = "SELECT * FROM data"
    wheres = []
    if filters:
        for i in range(int(len(filters) / 2)):
            filters[i * 2] = "$.%s" % filters[i * 2]
            wheres.append("json_extract(json, ?) = ?")
        sql += " WHERE %s" % " AND ".join(wheres)

    print(sql, filters)
    for i in curs.execute(sql, filters):
        rs.append(i)

    print(len(rs))

db = sqlite3.connect('results.db')
main(db, sys.argv[1:])

(Need to find a better way to do parameterised queries when you don't know how many WHERE clauses you might need)

The timings are from bencher13, which is loaded, but good enough for rough figures.

Load all data

$ multitime -n10 ./load.py
SELECT * FROM data []
1350
...
===> multitime results
1: ./load.py
            Mean        Std.Dev.    Min         Median      Max
real        0.051       0.003       0.041       0.051       0.054
user        0.033       0.005       0.024       0.032       0.044
sys         0.016       0.004       0.004       0.016       0.020

Load all PyPy data

$ multitime -n10 ./load.py lang PyPy
SELECT * FROM data WHERE json_extract(json, ?) = ? ['$.lang', 'PyPy']
180
...
===> multitime results
1: ./load.py lang PyPy
            Mean        Std.Dev.    Min         Median      Max
real        0.077       0.001       0.076       0.077       0.079
user        0.065       0.004       0.056       0.066       0.072
sys         0.010       0.004       0.004       0.008       0.020

Load all PyPy data for the fasta benchmark

vext01@bencher13:~/research/k2-sqlite$ /opt/multitime/bin/multitime -n10 ./load.py lang PyPy bench fasta
SELECT * FROM data WHERE json_extract(json, ?) = ? AND json_extract(json, ?) = ? ['$.lang', 'PyPy', '$.bench', 'fasta']
30
...
===> multitime results
1: ./load.py lang PyPy bench fasta
            Mean        Std.Dev.    Min         Median      Max
real        0.082       0.001       0.081       0.082       0.083
user        0.072       0.007       0.056       0.072       0.080
sys         0.008       0.007       0.000       0.008       0.024

Caching

I was worried that in queries like:

SELECT * FROM data WHERE json_extract(json, '$.lang') = 'PyPy' AND json_extract(json, '$.bench') = 'fasta';

We'd end up parsing the json twice. Once for each json_extract. However, sqlite caches the parsed json. From the sqlite sources:

static void jsonExtractFunc(...){
  ...
  p = jsonParseCached(ctx, argv, ctx);
  ...
/*
** Obtain a complete parse of the JSON found in the first argument
** of the argv array.  Use the sqlite3_get_auxdata() cache for this
** parse if it is available.  If the cache is not available or if it
** is no longer valid, parse the JSON again and return the new parse,
** and also register the new parse so that it will be available for
** future sqlite3_get_auxdata() calls.
*/
static JsonParse *jsonParseCached(
  sqlite3_context *pCtx,
  sqlite3_value **argv,
  sqlite3_context *pErrCtx
) {
...

Conventional Fields

I wonder if fields that we always expect to be present by convention should have their own field in the database schema? This would make the database a bit smaller/faster. Maybe performance isn't important given how fast this already is...

gabi-250 commented 5 years ago

Thanks for trying this out!

$ du -h results.db
27M     results.db

This could be smaller if we used one record for all pexecs of a given configuration. Much of this size is probably repetition of the field names in the JSON. Sqlite does not compress.

How large is the equivalent results file generated by krun?

gabi-250 commented 5 years ago

I wonder if fields that we always expect to be present by convention should have their own field in the database schema? This would make the database a bit smaller/faster. Maybe performance isn't important given how fast this already is...

Other than the bench, pexec, and lang, what else do we always expect to be present? I suppose we can't assume that ipi_wcts (or any other measurement) is always present, since users can choose which measurements to include in the results, right?

vext01 commented 5 years ago

How large is the equivalent results file generated by krun?

Compressed: 103MiB Uncompressed: 510MiB

But bear in mind the Krun results contain a lot more stuff, like core cycles, aperf, mperf, dmesg, audit etc. so it's not a fair comparison.

Other than the bench, pexec, and lang, what else do we always expect to be present?

We mustn't forget a machine field. This should always be present.

I suppose we can't assume that ipi_wcts (or any other measurement) is always present, since users can choose which measurements to include in the results, right?

Agreed, I think ;)