How to index large corpus which cannot be loaded into the memory?

shaoyijia commented 1 year ago

Hi,

Thank you so much for sharing and maintaining the code!

I'm using the off-the-shelf ColBERT v2 for the retrieval part of my system. So, I mainly call the index() API, indexer.index(name=index_name, collection=collection, overwrite=True). It works well on small corpus but when I try to move to a large corpus (the tsv collection is about 76G), the indexer cannot work. Specifically, the process will get killed when the MEM consumption exceeds the maximum memory of my server.

Is there a way to use ColBERT for very large corpus?

I went through previous issues and found #64 is related. However, I cannot find information related to batch retrieval in the readme. Is it still supported or did I miss anything?

Thank you so much in advance!!

okhat commented 1 year ago

Do you know where it crashes exactly? How much ram do you have?

shaoyijia commented 1 year ago

My RAM size is 216G. I think the problem happens in Launcher.launch(). A process is killed and the program crashes. I tracked memory usage through htop when I reran the code - the program crashed exactly when the memory usage exceeded the limit.

Log from the console is as follows:

Namespace(checkpoint='./workdir/colbert_model', collection='{myfile}.tsv', doc_
maxlen=140, experiment_name='stackoverflow', index_name='stackoverflow.2bits', nbits=2, nranks=1, test_query='When should I use unique_ptr in
 C++?')
[Aug 16, 13:18:51] #> Loading collection...
collection_path:  {myfile}.tsv
0M 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M 12M 13M 14M 15M 16M 17M 18M 19M 20M 21M 22M 23M 24M 25M 26M 27M 28M 29M 30M 31M 32M 33M
example passage from collection:  "Question: How to convert Decimal to Double in C#? Context of the question: <p>I want to assign the decimal
 variable &quot;trans&quot; to the double variable &quot;this.Opacity&quot;.</p> <pre class=""lang-cs prettyprint-override""><code>decimal tr
ans = trackBar1.Value / 5000; this.Opacity = trans; </code></pre> <p>When I build the app it gives the following error:</p> <blockquote> <p>C
annot implicitly convert type decimal to double</p> </blockquote>  | Answer at 2008-07-31T22:17:57.883 [voting=525]: <p>An explicit cast to <
code>double</code> like this isn't necessary:</p>  <pre><code>double trans = (double) trackBar1.Value / 5000.0; </code></pre>  <p>Identifying
 the constant as <code>5000.0</code> (or as <code>5000d</code>) is sufficient:</p>  <pre><code>double trans = trackBar1.Value / 5000.0; doubl
e trans = trackBar1.Value / 5000d; </code></pre> "

[Aug 16, 13:23:16] #> Creating directory {mydir}/stackoverflow.2bits

#> Starting...
Killed
make: *** [Makefile:124: index-domain-corpus] Error 137
(colbert) {my user info}$ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yijia/.conda/envs/colbert/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/yijia/.conda/envs/colbert/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

okhat commented 1 year ago

Oh. This seems simple. How many processes are you launching?

Each of them is loading the full tsv, so it just seems like your run crashes from just storing the strings in memory

shaoyijia commented 1 year ago

Sorry, but how to check how many processes I launch?

The main part of my code follows the readme file:

    with Run().context(RunConfig(nranks=args.nranks, experiment=args.experiment_name)):
        config = ColBERTConfig(doc_maxlen=args.doc_maxlen, nbits=args.nbits)

        indexer = Indexer(checkpoint=args.checkpoint, config=config)
        indexer.index(name=args.index_name, collection=collection, overwrite=True)

I already set args.nrank=1.

Btw, thank you so much for the prompt reply!!

fschlatt commented 1 year ago

My branch allows for in-memory indexing. Instead of passing a list of documents, you could probably also pass an iterator, that iteratively loads documents from disc. It most likely won't work off the bat, and you will most likely have to change a bit more code in the k-means clustering part. But my branch may be a good starting point

https://github.com/stanford-futuredata/ColBERT/pull/196

ravi-kumar-1010 commented 9 months ago

My branch allows for in-memory indexing. Instead of passing a list of documents, you could probably also pass an iterator, that iteratively loads documents from disc. It most likely won't work off the bat, and you will most likely have to change a bit more code in the k-means clustering part. But my branch may be a good starting point

196

Hey @fschlatt

Please share code to train an index using a csv file containing say few million rows, can't be loaded into the RAM, I tried using iterators and generators and even other data structures. I couldn't make it run. All it needs is a list in memory.

If you have made it run on a csv file in disk or a txt file . Please share the code for the same.

Thanks.

fschlatt commented 9 months ago

My branch doesn't support an iterator, but adds support for in-memory collections. With some minor modifications, you should be able to get it working with an iterator: https://github.com/fschlatt/ColBERT

I could place to start is here: https://github.com/fschlatt/ColBERT/blob/541b9e73edbe61c7a86e789a87980c4f09bf6053/colbert/data/collection.py#L18

The Collection class currently supports dictionaries or loading a tsv file. If you add support for an iterator it might already work.

stanford-futuredata / ColBERT

How to index large corpus which cannot be loaded into the memory? #234

196