I am trying to fine-tune the ColBERT checkpoint using the following code on Google Colab
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Trainer
checkpoint = 'colbert-ir/colbertv2.0'
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="wtb")):
config = ColBERTConfig(bsize=32, lr=3e-06, warmup=None, doc_maxlen=180, dim=128, nway=2, accumsteps=1, use_ib_negatives=False)
trainer = Trainer(
triples="/content/drive/MyDrive/QS/WTB/output.jsonl",
queries="/content/drive/MyDrive/QS/WTB/train_queries.tsv",
collection="/content/drive/MyDrive/QS/WTB/collection.tsv",
config=config,
)
checkpoint_path = trainer.train(checkpoint=checkpoint)
print(f"Saved checkpoint to {checkpoint_path}...")
However the process seems stuck at trainer.train with the output as #> Starting.... On forcefully stopping execution of the cell, this is the error stack it shows
KeyboardInterrupt Traceback (most recent call last)
[<ipython-input-8-5f9b6f9a73c7>](https://localhost:8080/#) in <cell line: 19>()
26 )
27
---> 28 checkpoint_path = trainer.train(checkpoint=checkpoint)
29 print(f"Saved checkpoint to {checkpoint_path}...")
30
6 frames
[/content/ColBERT/colbert/trainer.py](https://localhost:8080/#) in train(self, checkpoint)
29 launcher = Launcher(train)
30
---> 31 self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
32
33
[/content/ColBERT/colbert/infra/launcher.py](https://localhost:8080/#) in launch(self, custom_config, *args)
75 # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
76 print("BB", all_procs)
---> 77 return_values = sorted([return_value_queue.get() for _ in all_procs])
78 return_values = [val for rank, val in return_values]
79
[/content/ColBERT/colbert/infra/launcher.py](https://localhost:8080/#) in <listcomp>(.0)
75 # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
76 print("BB", all_procs)
---> 77 return_values = sorted([return_value_queue.get() for _ in all_procs])
78 return_values = [val for rank, val in return_values]
79
[/usr/lib/python3.10/multiprocessing/queues.py](https://localhost:8080/#) in get(self, block, timeout)
101 if block and timeout is None:
102 with self._rlock:
--> 103 res = self._recv_bytes()
104 self._sem.release()
105 else:
[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in recv_bytes(self, maxlength)
214 if maxlength is not None and maxlength < 0:
215 raise ValueError("negative maxlength")
--> 216 buf = self._recv_bytes(maxlength)
217 if buf is None:
218 self._bad_message_length()
[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv_bytes(self, maxsize)
412
413 def _recv_bytes(self, maxsize=None):
--> 414 buf = self._recv(4)
415 size, = struct.unpack("!i", buf.getvalue())
416 if size == -1:
[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv(self, size, read)
377 remaining = size
378 while remaining > 0:
--> 379 chunk = read(handle, remaining)
380 n = len(chunk)
381 if n == 0:
KeyboardInterrupt:
Is there an issue with multiprocessing on Colab? I am also training on a very small subset of the data, so data size should not be an issue.
I am trying to fine-tune the ColBERT checkpoint using the following code on Google Colab
However the process seems stuck at
trainer.train
with the output as#> Starting...
. On forcefully stopping execution of the cell, this is the error stack it showsIs there an issue with multiprocessing on Colab? I am also training on a very small subset of the data, so data size should not be an issue.