Benchmark sqlite db - Githubissues

rishiloyola commented 7 years ago

Reference issue #14

rishiloyola commented 7 years ago

Benchmark

Time take to insert 1000 documents: 5 nanoseconds
Time take to database lookup: 1 nanoseconds

Files details:

benchmark.sh - script to find the time interval
write.go - to generate 1000 documents for the benchmarking

vkuznet commented 7 years ago

Rishi, I should be more explicit with this PR. The test you did is fine, but not really representative of real life. You generated a "tiny" database with single dataset/file names. Instead we need to test a large database with unique combinations of dataset/blocks/files. Here is what we need:

remove benchmark/data.sql
write a code to generate data.sql
organize a loop over O(1K) datasets
- generate a dataset name, e.g. /a/b/c
- for every dataset create O(1K) unique blocks, e.g. /a/b/c#123, /a/b/c#234, etc.
  - for every block create O(1K) unique files, e.g. /abc-123-file1.root, /abc-123-file2.root /abc-234-file1.root, /abc-234-file2.root here I reuse name of dataset and blocks uids to easily create unique files
Such loop will produce O(1K) unique datasets, where each dataset will have 1K unique blocks and each block will have 1K unique files. The total size of all files will be 1K x 1K * 1K, etc. Then you may index the database on dataset, block and files. Check out its growth.
apply your benchmark code to measure insert and look-up time over such large database.
in real life we deal with 100K datasets, 1M blocks and about 10-100M unique files. A single dataset contains 100-1K blocks, a single block may spawn 10-1K files.

Best, Valentin.

On 0, Rishi notifications@github.com wrote:

Benchmark
Time take to insert 1000 documents: 5 nanoseconds
Time take to database lookup: 1 nanoseconds
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303198288

rishiloyola commented 7 years ago

I tried to generate and store 1k 1k 1k files record. But my system is not supporting this much amount of data. It's almost 25 GB data.

vkuznet commented 7 years ago

Rishi, are you limited by disk space or CPU usage?

In any case you can reduce number of blocks and files to 100 each, then if it will still not fit in your system reduce number of datasets to 100. So, you can have 100x100x100

On 0, Rishi notifications@github.com wrote:

I tried to generate and store 1k 1k 1k files record. But my system is not supporting this much amount of data. It's almost 25 GB data.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303788791

rishiloyola commented 7 years ago

It is taking more than half an hour to generate this data. And whenever I try to put this data in SQLite my system hangs up. I think so this is related to CPU usage. I have an old version of mac.

vkuznet commented 7 years ago

You should try bulk insert. For instance, generate 100 dataset names, keep them around in a list. Out of them generate either single SQL statement via union or multiple inserts statement within single transaction clause, e.g. https://stackoverflow.com/questions/1609637/is-it-possible-to-insert-multiple-rows-at-a-time-in-an-sqlite-database or similar discussions.

Then you can do the same for blocks, e.g. generate 100 blocks for a single dataset and insert this chunk. Repeat for every dataset.

On 0, Rishi notifications@github.com wrote:

It is taking more than half an hour to generate this data. And whenever I try to put this data in SQLite my system hangs up. I think so this is related to CPU usage.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303852999

rishiloyola commented 7 years ago

The benchmark for 100 100 100 records with individual inserts: Data Insert:

Real time: 99m 12.790s
User time: 2m 3.293s
System time: 9m 57.832s
CPU time: 12m 0s
data.sql size: 114.1 MB
data.db size: 107.4 MB

Query:

Q1: select * from files where id=82056036;
CPU Time: 0.007s

vkuznet commented 7 years ago

Once we modify a schema we also will need to verify and adjust code to reflect these changes, see https://github.com/vkuznet/transfer2go/blob/master/core/catalog.go#L125

vkuznet commented 7 years ago

insert_datasets.sql does not need neither datasetid or blockid while insert_blocks.sql does need datasetid similar to how it's done in insert_files.sql

On 0, Rishi notifications@github.com wrote:

@vkuznet In the schema we have datasetid as well as blockid while in insertion template we don't have blockid and databaseid. Do you want to make changes in insertion template?

https://github.com/vkuznet/transfer2go/blob/master/static/sql/sqlite3/insert_datasets.sql

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-304509698

rishiloyola commented 7 years ago

The new benchmark to insert data:

User Time: 2m 6.781s
Sys Time: 9m 52.826s
CPU Time: 11m 59s
data.sql size: 135.4 MB
data.db size: 150.3 MB

Benchmark of queries:

User Time: 0m 0.002s
Sys Time: 0m 0.004s
CPU Time: 0m 0.006s

@vkuznet How do you want to benchmark queries?

rishiloyola commented 7 years ago

@vkuznet Checkout last two commits.

vkuznet commented 7 years ago

The code now looks fine. But we need to test random queries. To do that, you can fetch all datasets via select * from datasets query. Then pick up a random one and then measure time to find files for a random dataset.

rishiloyola commented 7 years ago

@vkuznet That is what I am doing.

vkuznet commented 7 years ago

you need to capture results of selected datasets query, e.g. datasets=sqlite3 testdb "select * from datasets"; and then you can pick up a dataset from $datasets. For instance you can organize for loop and for certain dataset from $dataset you can execute block or files queries.

On 0, Rishi notifications@github.com wrote:

rishiloyola commented on this pull request.

@@ -4,10 +4,9 @@ echo "Time taken to insert the data" time sqlite3 test.db < data.sql

echo "Time taken to query" -time sqlite3 test.db "select from datasets where id=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59066;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59033;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033004;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033019;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where lfn=\"/1GN/XBy/iFl-/1GN/XBy/iFl#XCAo-y6F4v.root\";" >/dev/null 2>&1 +time sqlite3 test.db << EOF >/dev/null 2>&1 +_datasets = select from datasets; +_blocks = select * from blocks where datasetid=(select id from _datasets where dataset="/41Z/6Ik/KAy");

How can I generate random datasets? Do you want me to write random function generator for this?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#discussion_r119155635

vkuznet commented 7 years ago

No, these lines show queries how we'll insert data into db. And, for tests we carry mostly about look-up of data.

On 0, Rishi notifications@github.com wrote:

rishiloyola commented on this pull request.

@@ -4,10 +4,9 @@ echo "Time taken to insert the data" time sqlite3 test.db < data.sql

echo "Time taken to query" -time sqlite3 test.db "select from datasets where id=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59066;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59033;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033004;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033019;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where lfn=\"/1GN/XBy/iFl-/1GN/XBy/iFl#XCAo-y6F4v.root\";" >/dev/null 2>&1 +time sqlite3 test.db << EOF >/dev/null 2>&1 +_datasets = select from datasets; +_blocks = select * from blocks where datasetid=(select id from _datasets where dataset="/41Z/6Ik/KAy");

Are you trying to benchmark these queries? Link - https://github.com/vkuznet/transfer2go/blob/master/core/catalog.go#L125-L172

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#discussion_r119156267

rishiloyola commented 7 years ago

New Benchmark for queries :

Real Time: 0m 2.374s
User Time: 0m 1.385s
Sys Time: 0m 0.459s
CPU Time: 0m 1.7s

rishiloyola commented 7 years ago

@vkuznet You can now merge this.

vkuznet / transfer2go

Benchmark sqlite db #23