Open vbod opened 4 years ago
seems odd that the recursion would be so deep! it feels like that's a sign something with the tree splitting is off, like for instance the splits end up being very imbalanced. is there something "weird" about your data?
@erikbern Don't know much of the algo beneath sorry, I may have a look to understand a bit more why it's that deep... The splits are very imbalanced indeed (if I can measure it via the "no hyperplane found blah-blah" I'd say it's super imbalanced even ^^), but nothing "weird" about the data: they are 10M vectors trained by a neural network encoder for dense retrieval (i.e. retrieve the results of a query by encoding the query and scanning an index of vectors to find approximate nearest neighbors). I say nothing "weird" cause when I end-up having it work, the results are fairly good!
Another fix I did not mention though would be to increase the stack size but not sure that would help much for all use-cases (except if I'm the only one encountering that kind of issue).
ok, for typical datasets, the "no hyperplane found" should be very rare, so i think something might be off with the data. do you have a lot of degenerate data? as in, lots of vectors that are identical? just thinking if there's something obviously "weird"
One thought would be that making this condition slightly more liberal should resolve the issue: https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1213
eg instead of checking if either subset is empty, maybe check for say up to 98-2% imbalance or similar. The max recursion depth would then be roughly -log(n)/log(0.98) which for n=10M would imply about 800, which I think should be well within the max stack size
Hi @erikbern, sorry for not answering before.
Currently the _make_tree()
function is called roughly 4,000 times with a 10M dataset like mine. I joined a recursion.log where I printed the recursion depth along with the "no hyperplane found" (this is the pass 0 until crash). I now have datasets that make this recursion crash even when changing stack objects to pointers. I guess there's a lot being pushed on the stack frame and it's definitly more than can be handled.
I do agree that changing https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1213 could ensure that splits are not too uneven and reduce the recursion level. I can give it a try and let you know. Hopefully this won't hurt accuracy too much: I don't know how much this randomness impacts compared to when the split boundary is obtained by statistics in create_split()
func... 🤔
I still find it quite odd that your dataset is so hard to split. Any way you can share the data?
Yes sure: it's roughly 8.8M vectors of size 256 gotten from a neural net output I've coded trained for "dense" retrieval (inspired from this paper though slightly arranged). I've stored them in 18 TF_Records (don't know if you use TensorFlow but if not I can surely give it to you in a more friendly format). Each of those 18 files is roughly 500Mo. Tell me how you want me to transfer it to you.
Here's a Python script that can help you play around:
import tensorflow as tf
import annoy
def read_encoded_passages(passage_encoded_file: str) -> tf.data.Dataset:
"""
Reads a TF_Record file that was stored such as each line contains a dictionary:
> {
> 'vector': vector of the passage,
> 'identifier': id of the passage
> }
"""
def _extract_fn(record: str):
sample = tf.io.parse_single_example(
record,
{
'vector': tf.io.FixedLenSequenceFeature([], tf.float32, allow_missing=True),
'identifier': tf.io.FixedLenSequenceFeature([], tf.int64, allow_missing=True)
})
return {
'vector': sample['vector'],
'identifier': tf.cast(sample['identifier'], tf.int32)
}
return tf.data.TFRecordDataset(
[passage_encoded_file]
).map(
extract_fn,
num_parallel_calls=tf.data.experimental.AUTOTUNE
).prefetch(
tf.data.experimental.AUTOTUNE
)
def main(encoded_passages_files: list,
nb_trees: int = 50):
"""
Builds an ANN-index listing all the data in the provided file paths.
"""
ann_index = annoy.AnnoyIndex(256, 'dot')
ann_index.verbose(True)
for encoded_passages_file in encoded_passages_files:
print('Indexing passage vectors %s... ' % encoded_passages_file)
passages_dataset = read_encoded_passages(encoded_passages_file)
for j, passage in enumerate(passages_dataset):
passage_id = passage['identifier'].numpy().tolist()[0]
passage_vector = passage['vector'].numpy().tolist()
ann_index.add_item(passage_id, passage_vector)
ann_index.build(nb_trees)
if __name__ == '__main__':
DATA_PATH = '/path/to/directory/with/all/18/files/inside'
main(
encoded_passages_files=[os.path.join(DATA_PATH, 'passages.tfrecord_%d.encoded') % i for i in range(0, 18)]
)
By the way, I did change the condition on having at least one element by:
bool _draw_random_split(const vector<S>& left_indices, const vector<S>& right_indices) {
if (left_indices.empty() || right_indices.empty())
return true;
float ls = (float)left_indices.size();
float rs = (float)right_indices.size();
float f = ls/(ls+rs);
return f < 0.02 || f > 0.98;
}
S _make_tree(const vector<S >& indices, bool is_root) {
...
while (_draw_random_split(children_indices[0], children_indices[1])) {
....
}
}
and it does work like a charm for building. Haven't tested yet on a dataset I know if the search accuracy is impacted though I don't think it should. Thanks for the pointers @erikbern.
it would be amazing if you can send me a pull request for the code snippet you just shared!
the dataset sounds huge – maybe a bit hard to send to me. if you can do some basic sanity check that would be interesting. like do you have a lot of vectors that are identical? (or even more degenerate – maybe they are all the zero vector or something like that?)
Will push a PR today (this part is way easier isolated than the ftruncate
issue 😉).
As for the dataset, it's been trained to have a very particular topology: there's a huge cluster of roughly half the dataset that is isolated (bunch of garbage that we haven't been trained to retrieve at all). The rest is grouped by semantic similarit. Doing T-SNE visualization does not provide much more insight, sorry...
FWIW, I have a similar dataset, and was seeing super deep stacks (~700) and long processing time with latest HEAD (specifically after 6088276a7a677ef1ea61ab886baa3919a258656a reverted the _draw_random_split
patch). With the _draw_random_split
things are much improved.
Version: 1.16.3.
Large datasets may crash silently with a stack overflow (quite hard to catch as a end-user). I can provide one such dataset composed of 10M vectors of size 256 that crashes on my Window 64-bit).
Actually there are several parts of the code that use stack allocation and may cause this issue:
_make_tree()
is called recursively and a pointer to aNode
(namelym
in the code) is stack allocated. On large datasets I found this function gets called up to a thousand times.create_split()
(for me it was theDotProduct
's implementation) stack-allocates two pointers toNode<S, T>
. There's actually a "deeper" issue with that part: the function is marked for being inlined by the compiler... So it can get pretty messy. I know standard compiler would never inline a function that does stack-allocation by themselves, but not sure on what happens when you explicitly ask them to 🤔 There are a bit of non-conclusive threads on StackOverflow on this one, e.g. here and also mentionned in the 4th answer in here... My point is: this function is called within the recursion of_make_tree()
so the liftetime of those inlined stack objects can become very long and I don't think stack allocation here is very wise. I don't know how big they are (haven't deep-dived in the algorithmic yet), but the number of them (again up to thousands) surely blows the stack out.I can fix it quite easily by replacing
alloca()
by standard heap-allocation and freeing memory at the end of the functions (so that there simply are pointers - 32 or 64 bits - on the stack and mitigate the recursive part... ). Another fix (that I haven't tried) would be to keep stack-allocation for the two pointers increate_split()
but not inlining them (so that when the function is exited the memory gets freed). Yet in_make_tree()
it would be safer to use heap allocation I guess (again don't know the size of that structure though, but stack allocation in recursive function is always bug-prone if we can't control how deep the recursion will be).