stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

RuntimeError: quantile() input tensor is too large #299

Closed liuqi6777 closed 5 months ago

liuqi6777 commented 5 months ago

Hello, I am trying to index some documents using the ColBERT model finetuned by myself. For some reason, I set the output dim of embeddings to 768 instead of the original 128, but I got this error when I ran the indexing code:

Traceback (most recent call last):
  File "/home/qiliu/miniconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/qiliu/miniconda3/envs/colbert/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/qiliu/workspace/ColBERT/colbert/infra/launcher.py", line 134, in setup_new_process
    return_val = callee(config, *args)
  File "/home/qiliu/workspace/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/home/qiliu/workspace/ColBERT/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/home/qiliu/workspace/ColBERT/colbert/indexing/collection_indexer.py", line 237, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/home/qiliu/workspace/ColBERT/colbert/indexing/collection_indexer.py", line 331, in _compute_avg_residual
    bucket_cutoffs = heldout_avg_residual.float().quantile(bucket_cutoffs_quantiles)
RuntimeError: quantile() input tensor is too large

Here is my indexing code:

import argparse
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--checkpoint', type=str, required=True)
    parser.add_argument('--experiment', type=str, required=True)
    parser.add_argument('--index_name', type=str, required=True)
    parser.add_argument('--collection', type=str, required=True)
    args = parser.parse_args()

    with Run().context(RunConfig(nranks=1, experiment=args.experiment)):
        config = ColBERTConfig(nbits=2, root="experiments")
        indexer = Indexer(checkpoint=args.checkpoint, config=config)
        indexer.index(name=args.index_name, collection=args.collection)

I can correctly index the documents when using the colbertv2.0 checkpoint, so I guess the output dim of embeddings is the reason I got this error. I have tried to set nbits to larger numbers like 4 or 8, but it didn't work. So how can I solve this problem?

Thanks in advance!

okhat commented 5 months ago

This error comes from pytorch, input tensor too large. A bit strange. Is your dataset extremely large, like over 100M docs?

liuqi6777 commented 5 months ago

This error comes from pytorch, input tensor too large. A bit strange. Is your dataset extremely large, like over 100M docs?

It has 170k documents and doesn't seem very large.

okhat commented 5 months ago

Okay I suggest googling that error to understand why pytorch doesn’t like this.

alternatively, try indexing with the official checkpoint — in general 768 is way too large for these vectors, you should is 64, 128, 256 (power of two in this range)

liuqi6777 commented 5 months ago

Thanks for your reply! I will try to fix it based on your suggestion, and if there are any new situations, I will update this issue :)