stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

Process stuck on Launcher while training using example code #254

Open roynirmal opened 1 year ago

roynirmal commented 1 year ago

I am trying to fine-tune the ColBERT checkpoint using the following code on Google Colab

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Trainer
checkpoint = 'colbert-ir/colbertv2.0'
if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="wtb")):
       config = ColBERTConfig(bsize=32, lr=3e-06, warmup=None, doc_maxlen=180, dim=128, nway=2, accumsteps=1, use_ib_negatives=False)
       trainer = Trainer(
            triples="/content/drive/MyDrive/QS/WTB/output.jsonl",
            queries="/content/drive/MyDrive/QS/WTB/train_queries.tsv",
            collection="/content/drive/MyDrive/QS/WTB/collection.tsv",
            config=config,
        )

       checkpoint_path = trainer.train(checkpoint=checkpoint)
       print(f"Saved checkpoint to {checkpoint_path}...")

However the process seems stuck at trainer.train with the output as #> Starting.... On forcefully stopping execution of the cell, this is the error stack it shows

KeyboardInterrupt                         Traceback (most recent call last)
[<ipython-input-8-5f9b6f9a73c7>](https://localhost:8080/#) in <cell line: 19>()
     26     )
     27 
---> 28     checkpoint_path = trainer.train(checkpoint=checkpoint)
     29     print(f"Saved checkpoint to {checkpoint_path}...")
     30 

6 frames
[/content/ColBERT/colbert/trainer.py](https://localhost:8080/#) in train(self, checkpoint)
     29         launcher = Launcher(train)
     30 
---> 31         self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
     32 
     33 

[/content/ColBERT/colbert/infra/launcher.py](https://localhost:8080/#) in launch(self, custom_config, *args)
     75         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     76         print("BB", all_procs)
---> 77         return_values = sorted([return_value_queue.get() for _ in all_procs])
     78         return_values = [val for rank, val in return_values]
     79 

[/content/ColBERT/colbert/infra/launcher.py](https://localhost:8080/#) in <listcomp>(.0)
     75         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     76         print("BB", all_procs)
---> 77         return_values = sorted([return_value_queue.get() for _ in all_procs])
     78         return_values = [val for rank, val in return_values]
     79 

[/usr/lib/python3.10/multiprocessing/queues.py](https://localhost:8080/#) in get(self, block, timeout)
    101         if block and timeout is None:
    102             with self._rlock:
--> 103                 res = self._recv_bytes()
    104             self._sem.release()
    105         else:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv_bytes(self, maxsize)
    412 
    413     def _recv_bytes(self, maxsize=None):
--> 414         buf = self._recv(4)
    415         size, = struct.unpack("!i", buf.getvalue())
    416         if size == -1:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

KeyboardInterrupt:

Is there an issue with multiprocessing on Colab? I am also training on a very small subset of the data, so data size should not be an issue.