terrierteam / pyterrier_colbert

83 stars 35 forks source link

Number of partitions is fixed to 100 #66

Open afurkank opened 9 months ago

afurkank commented 9 months ago

Hi, thanks for this great repo!

In indexing.py, the number of partitions is set to 100 here.

Since this condition will always be false, the index will always consist of 100 partitions.

Is this the intended behavior? Would that affect the retrieval effectiveness?

cmacdonald commented 9 months ago

Hi @afurkank

Thanks for the comments and noting this. Partitions code is a bit of a mess.

I think we were trying to reproduce the logic at https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/index_faiss.py#L31 which is however initialised to None at: https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/utils/parser.py#L80

The problem is how the Factory class knows what the partitions setting should be: https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498 (I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

The easiest hack would be to look for invfqp.*.faiss in https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L612-L629 and open the first one, warning if more than one is found.

Craig

afurkank commented 9 months ago

Thanks for the quick response.

The problem is how the Factory class knows what the partitions setting should be: https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498 (I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

So this shouldn't affect the scores if I understood it correctly?

afurkank commented 9 months ago

So I just did a comparison with the Vaswani dataset between indexing with number of partitions fixed to 100 and when number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))). It appears there is very little difference.

For example, when number of partitions is fixed to 100, nDCG@10 for the Vaswani dataset is 0.426272. When number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))), nDCG@10 for the same dataset is 0.425488.

cmacdonald commented 9 months ago

The Faiss ANN stage is only for identifying candidates. The 2nd stage reranking process will hide much of the difference.

Vaswani is small enough that probably enough candidate documents would be identified for each query. Even at ColBERT it might be enough. Num partitions would also have a efficiency impact.

Do you have colbert index for msmarco? it would be reasonably straightforward to built faiss indices with both 100 and the default value.

afurkank commented 9 months ago

I do not have the index for msmarco unfortunately. I don't have a PC with enough compute power to index that big of a dataset.

I could open a pull request for fixing the issue of number of partitions being fixed to 100 and the issue of not accepting different faiss indexes with names including the partition number(such as invfqp.*.faiss, as you mentioned earlier) if it would help.

cmacdonald commented 9 months ago

PR would help massively, thank you @afurkank!. We'll just not merge it till we have checked the effectiveness numbers.