Open shaoyijia opened 1 year ago
Do you know where it crashes exactly? How much ram do you have?
My RAM size is 216G. I think the problem happens in Launcher.launch()
. A process is killed and the program crashes. I tracked memory usage through htop
when I reran the code - the program crashed exactly when the memory usage exceeded the limit.
Log from the console is as follows:
Namespace(checkpoint='./workdir/colbert_model', collection='{myfile}.tsv', doc_
maxlen=140, experiment_name='stackoverflow', index_name='stackoverflow.2bits', nbits=2, nranks=1, test_query='When should I use unique_ptr in
C++?')
[Aug 16, 13:18:51] #> Loading collection...
collection_path: {myfile}.tsv
0M 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M 12M 13M 14M 15M 16M 17M 18M 19M 20M 21M 22M 23M 24M 25M 26M 27M 28M 29M 30M 31M 32M 33M
example passage from collection: "Question: How to convert Decimal to Double in C#? Context of the question: <p>I want to assign the decimal
variable "trans" to the double variable "this.Opacity".</p> <pre class=""lang-cs prettyprint-override""><code>decimal tr
ans = trackBar1.Value / 5000; this.Opacity = trans; </code></pre> <p>When I build the app it gives the following error:</p> <blockquote> <p>C
annot implicitly convert type decimal to double</p> </blockquote> | Answer at 2008-07-31T22:17:57.883 [voting=525]: <p>An explicit cast to <
code>double</code> like this isn't necessary:</p> <pre><code>double trans = (double) trackBar1.Value / 5000.0; </code></pre> <p>Identifying
the constant as <code>5000.0</code> (or as <code>5000d</code>) is sufficient:</p> <pre><code>double trans = trackBar1.Value / 5000.0; doubl
e trans = trackBar1.Value / 5000d; </code></pre> "
[Aug 16, 13:23:16] #> Creating directory {mydir}/stackoverflow.2bits
#> Starting...
Killed
make: *** [Makefile:124: index-domain-corpus] Error 137
(colbert) {my user info}$ Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/yijia/.conda/envs/colbert/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/yijia/.conda/envs/colbert/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Oh. This seems simple. How many processes are you launching?
Each of them is loading the full tsv, so it just seems like your run crashes from just storing the strings in memory
Sorry, but how to check how many processes I launch?
The main part of my code follows the readme file:
with Run().context(RunConfig(nranks=args.nranks, experiment=args.experiment_name)):
config = ColBERTConfig(doc_maxlen=args.doc_maxlen, nbits=args.nbits)
indexer = Indexer(checkpoint=args.checkpoint, config=config)
indexer.index(name=args.index_name, collection=collection, overwrite=True)
I already set args.nrank=1
.
Btw, thank you so much for the prompt reply!!
My branch allows for in-memory indexing. Instead of passing a list of documents, you could probably also pass an iterator, that iteratively loads documents from disc. It most likely won't work off the bat, and you will most likely have to change a bit more code in the k-means clustering part. But my branch may be a good starting point
My branch allows for in-memory indexing. Instead of passing a list of documents, you could probably also pass an iterator, that iteratively loads documents from disc. It most likely won't work off the bat, and you will most likely have to change a bit more code in the k-means clustering part. But my branch may be a good starting point
196
Hey @fschlatt
Please share code to train an index using a csv file containing say few million rows, can't be loaded into the RAM, I tried using iterators and generators and even other data structures. I couldn't make it run. All it needs is a list in memory.
If you have made it run on a csv file in disk or a txt file . Please share the code for the same.
Thanks.
My branch doesn't support an iterator, but adds support for in-memory collections. With some minor modifications, you should be able to get it working with an iterator: https://github.com/fschlatt/ColBERT
I could place to start is here: https://github.com/fschlatt/ColBERT/blob/541b9e73edbe61c7a86e789a87980c4f09bf6053/colbert/data/collection.py#L18
The Collection
class currently supports dictionaries or loading a tsv file. If you add support for an iterator it might already work.
Hi,
Thank you so much for sharing and maintaining the code!
I'm using the off-the-shelf ColBERT v2 for the retrieval part of my system. So, I mainly call the
index()
API,indexer.index(name=index_name, collection=collection, overwrite=True)
. It works well on small corpus but when I try to move to a large corpus (the tsv collection is about 76G), the indexer cannot work. Specifically, the process will get killed when the MEM consumption exceeds the maximum memory of my server.Is there a way to use ColBERT for very large corpus?
I went through previous issues and found #64 is related. However, I cannot find information related to batch retrieval in the readme. Is it still supported or did I miss anything?
Thank you so much in advance!!