Now we hard coded 65536 as a default block size which is waste of space especially when we are working with relatively low-dimensional embeddings (like 64d).
This PR introduces max_edges=[int] index parameter and also estimate reasonable default in case if parameter is omitted.
The logic behind current estimate is the following:
We don't want to have uncontrollable write amplification - so we should have hard upper bound for disk amplification due to neighbours redundancy in every node block. Current hard limit is 50x redundancy compared to the node storage overhead (note, that with compression of neighbours this hard limit will automatically adjust and allow us to store more neighbours per node block)
The amount of neighbours should grows with the dimension of the embeddings space. I don't have references about why something like $D^{\frac{1}{C}}$ (where $D$ - space dimensionality) must be used, but I have hand-wavy arguments for myself like this: with $D^\frac{1}{C}$ neighbours for every vertex we will have $O(D)$ neighbours in $C$-radius of our graph and with $O(D)$ neighbours (if they are random enough) we have an ability to go into direction which will strictly improve our distance (generally, in $D$-dimensional metric space you need at least $D+1$ neighbours (simplex) in order to have an improvement in any direction).
In my experiments $2 \sqrt{D} \ldots 3 \sqrt{D}$ shows pretty good 10-recall@10 values (90%+) - so I choose $3 \sqrt{D}$ as a second part of the max_edges estimation.
Context
Now we hard coded 65536 as a default block size which is waste of space especially when we are working with relatively low-dimensional embeddings (like 64d).
This PR introduces
max_edges=[int]
index parameter and also estimate reasonable default in case if parameter is omitted.The logic behind current estimate is the following:
50x
redundancy compared to the node storage overhead (note, that with compression of neighbours this hard limit will automatically adjust and allow us to store more neighbours per node block)10-recall@10
values (90%+) - so I choose $3 \sqrt{D}$ as a second part of themax_edges
estimation.Overall, formula looks like this:
$max\_edges = \min( 3 \sqrt{D}, 50 \frac{node\_overhead(type, dims)}{edge\_overhead(type, dims)} )$
Changes
max_edges
parameter in the index creating syntax:max_edges
with the formula from aboveblock_size
binary parameter literally if it greater than 128, else - shift the value byDISKANN_BLOCK_SIZE_SHIFT
to get real block size in bytes