Closed rishiloyola closed 7 years ago
Benchmark
Time take to insert 1000 documents: 5 nanoseconds
Time take to database lookup: 1 nanoseconds
Files details:
benchmark.sh - script to find the time interval
write.go - to generate 1000 documents for the benchmarking
Rishi, I should be more explicit with this PR. The test you did is fine, but not really representative of real life. You generated a "tiny" database with single dataset/file names. Instead we need to test a large database with unique combinations of dataset/blocks/files. Here is what we need:
organize a loop over O(1K) datasets
Such loop will produce O(1K) unique datasets, where each dataset will have 1K unique blocks and each block will have 1K unique files. The total size of all files will be 1K x 1K * 1K, etc. Then you may index the database on dataset, block and files. Check out its growth.
Best, Valentin.
On 0, Rishi notifications@github.com wrote:
Benchmark
Time take to insert 1000 documents: 5 nanoseconds Time take to database lookup: 1 nanoseconds
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303198288
I tried to generate and store 1k 1k 1k files record. But my system is not supporting this much amount of data. It's almost 25 GB data.
Rishi, are you limited by disk space or CPU usage?
In any case you can reduce number of blocks and files to 100 each, then if it will still not fit in your system reduce number of datasets to 100. So, you can have 100x100x100
On 0, Rishi notifications@github.com wrote:
I tried to generate and store 1k 1k 1k files record. But my system is not supporting this much amount of data. It's almost 25 GB data.
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303788791
It is taking more than half an hour to generate this data. And whenever I try to put this data in SQLite my system hangs up. I think so this is related to CPU usage. I have an old version of mac.
You should try bulk insert. For instance, generate 100 dataset names, keep them around in a list. Out of them generate either single SQL statement via union or multiple inserts statement within single transaction clause, e.g. https://stackoverflow.com/questions/1609637/is-it-possible-to-insert-multiple-rows-at-a-time-in-an-sqlite-database or similar discussions.
Then you can do the same for blocks, e.g. generate 100 blocks for a single dataset and insert this chunk. Repeat for every dataset.
On 0, Rishi notifications@github.com wrote:
It is taking more than half an hour to generate this data. And whenever I try to put this data in SQLite my system hangs up. I think so this is related to CPU usage.
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-303852999
The benchmark for 100 100 100 records with individual inserts: Data Insert:
Real time: 99m 12.790s
User time: 2m 3.293s
System time: 9m 57.832s
CPU time: 12m 0s
data.sql size: 114.1 MB
data.db size: 107.4 MB
Query:
Q1: select * from files where id=82056036;
CPU Time: 0.007s
Once we modify a schema we also will need to verify and adjust code to reflect these changes, see https://github.com/vkuznet/transfer2go/blob/master/core/catalog.go#L125
insert_datasets.sql does not need neither datasetid or blockid while insert_blocks.sql does need datasetid similar to how it's done in insert_files.sql
On 0, Rishi notifications@github.com wrote:
@vkuznet In the schema we have datasetid as well as blockid while in insertion template we don't have blockid and databaseid. Do you want to make changes in insertion template?
https://github.com/vkuznet/transfer2go/blob/master/static/sql/sqlite3/insert_datasets.sql
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#issuecomment-304509698
The new benchmark to insert data:
User Time: 2m 6.781s
Sys Time: 9m 52.826s
CPU Time: 11m 59s
data.sql size: 135.4 MB
data.db size: 150.3 MB
Benchmark of queries:
User Time: 0m 0.002s
Sys Time: 0m 0.004s
CPU Time: 0m 0.006s
@vkuznet How do you want to benchmark queries?
@vkuznet Checkout last two commits.
The code now looks fine. But we need to test random queries. To do that, you can fetch all datasets via select * from datasets query. Then pick up a random one and then measure time to find files for a random dataset.
@vkuznet That is what I am doing.
you need to capture results of selected datasets query, e.g.
datasets=sqlite3 testdb "select * from datasets";
and
then you can pick up a dataset from $datasets.
For instance you can organize for loop
and for certain dataset from $dataset you can execute block or files queries.
On 0, Rishi notifications@github.com wrote:
rishiloyola commented on this pull request.
@@ -4,10 +4,9 @@ echo "Time taken to insert the data" time sqlite3 test.db < data.sql
echo "Time taken to query" -time sqlite3 test.db "select from datasets where id=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59066;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59033;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033004;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033019;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where lfn=\"/1GN/XBy/iFl-/1GN/XBy/iFl#XCAo-y6F4v.root\";" >/dev/null 2>&1 +time sqlite3 test.db << EOF >/dev/null 2>&1 +_datasets = select from datasets; +_blocks = select * from blocks where datasetid=(select id from _datasets where dataset="/41Z/6Ik/KAy");
How can I generate random datasets? Do you want me to write random function generator for this?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#discussion_r119155635
No, these lines show queries how we'll insert data into db. And, for tests we carry mostly about look-up of data.
On 0, Rishi notifications@github.com wrote:
rishiloyola commented on this pull request.
@@ -4,10 +4,9 @@ echo "Time taken to insert the data" time sqlite3 test.db < data.sql
echo "Time taken to query" -time sqlite3 test.db "select from datasets where id=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59066;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from blocks where datasetid=59 AND id=59033;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033004;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where datasetid=59 AND id=59033 AND id=59033019;" >/dev/null 2>&1 \ -&& sqlite3 test.db "select from files where lfn=\"/1GN/XBy/iFl-/1GN/XBy/iFl#XCAo-y6F4v.root\";" >/dev/null 2>&1 +time sqlite3 test.db << EOF >/dev/null 2>&1 +_datasets = select from datasets; +_blocks = select * from blocks where datasetid=(select id from _datasets where dataset="/41Z/6Ik/KAy");
Are you trying to benchmark these queries? Link - https://github.com/vkuznet/transfer2go/blob/master/core/catalog.go#L125-L172
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/transfer2go/pull/23#discussion_r119156267
New Benchmark for queries :
Real Time: 0m 2.374s
User Time: 0m 1.385s
Sys Time: 0m 0.459s
CPU Time: 0m 1.7s
@vkuznet You can now merge this.
Reference issue #14