google-research / google-research

Google Research
https://research.google
Apache License 2.0
33.98k stars 7.85k forks source link

[ScaNN] Choosing a value for training_sample_size #622

Open JonasTriki opened 3 years ago

JonasTriki commented 3 years ago

Hi there!

How should one go about selecting training_sample_size for the tree() and score_ah() methods of the ScannBuilder class? The hyperparameter is not mentioned in the algorithms section. Should one leave it as default (e.g. 100000) or stick to a value similar to the one from the example notebook (e.g. 250000)? Does it depend on the dataset? In my case, I would like to build a ScaNN index on word embeddings with ~4M rows and 300 features.

Thanks in advance.

Audida commented 3 years ago

Hi, have you figured it out in the meantime? I have the same question..

JonasTriki commented 3 years ago

Hi, have you figured it out in the meantime? I have the same question..

Nope, I have just left it as 250000 for the time being!

cramraj8 commented 3 years ago

If we are building an index offline, shouldn't it cover all the index samples ?