Reduce CDB header size (2048 bytes -> 1024 bytes) - Githubissues

radareorg / sdb

Simple and fast string based key-value database with support for arrays and json

https://www.radare.org/

MIT License

217 stars 62 forks source link

Reduce CDB header size (2048 bytes -> 1024 bytes) #92

Closed deffi420 closed 8 years ago

deffi420 commented 8 years ago

The trick is to compute the subtable length n_i, 0 <= i <= 255, using the subtable pointers pi and p{i+1}. Observe that

    n_i = (p_{i+1} - p_i) / L, where L is the slot length.

So let's not waste 1024 bytes of space to encode n_i.

radare commented 8 years ago

screen shot 2016-01-23 at 23 38 01

this change makes sdb a bit slower

radare commented 8 years ago

in percentages:

database seems to be 15% smaller
speed is 1% slower

radare commented 8 years ago

For some cases it will be faster to precalculate all the n_is and keep it in memory until the sdb instance is gone. But this seems to be a fair exchange between dbsize and performance

deffi420 commented 8 years ago

2016-01-24-014702_908x767_scrot The running times are pretty much the same for me (CFLAGS="-O2"). Nonetheless, I'll try to optimize the n_i computation tomorrow.

radare commented 8 years ago

For 100K rows , the new implementation takes 0.370s vs 0.447s to generate the database. so it means that syncing is faster, which is very important operation because its never updating the disk file. (18% faster)

and with 100K rows, the database is much smaller. 4.1MB vs 3.3MB.

We are missing a test for retrieving, i’ll add it now, so i think the benchmarks clearly show that this thing must be merged :)

Send those optimizations in a separate PR.

On 24 Jan 2016, at 00:51, deffi420 notifications@github.com wrote:

https://cloud.githubusercontent.com/assets/5674089/12533570/5c7bf15a-c23c-11e5-820e-e9f53d093eaa.png The running times are pretty much the same for me (CFLAGS="-O2"). Nonetheless, I'll try to optimize the n_i computation tomorrow.

— Reply to this email directly or view it on GitHub https://github.com/radare/sdb/pull/92#issuecomment-174235228.

radare commented 8 years ago

It’s obvious, but worthless to say that this change makes databases built with the previous code are no longer accessible. this will break for example the information in projects in r2.

as long as sdb deosnt puts any header in the database, there’s no way to specify a version number or filetype and we cant handle this. But i think we can go forward, and just warn users about this breakage, but i dont think its gonna be a huge problem

On 24 Jan 2016, at 00:51, deffi420 notifications@github.com wrote:

https://cloud.githubusercontent.com/assets/5674089/12533570/5c7bf15a-c23c-11e5-820e-e9f53d093eaa.png The running times are pretty much the same for me (CFLAGS="-O2"). Nonetheless, I'll try to optimize the n_i computation tomorrow.

— Reply to this email directly or view it on GitHub https://github.com/radare/sdb/pull/92#issuecomment-174235228.

deffi420 commented 8 years ago

I'm afraid there's something wrong. The database should be 1024 bytes smaller, not 15%...

2016-01-24-022217_1359x767_scrot

For some reason the new CDB hashmap is 100% full (i.e., collisions occur all the time). We need some empty slots to speed up lookups.

deffi420 commented 8 years ago

Ok, the problem is here https://github.com/radare/sdb/commit/90a6981271b611dc880ed06d4f913e04476af285#diff-707f52f9224c9ce39cb59909122d1581R150

count is the number of reserved slots in a subtable, and len is the total length of the subtable. We do not want these to be equal because then there is no empty slots and hash collisions occur 100% of the time. So we should revert the change back to: len = count << 1.

My bad...