stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

Multiprocessing takes forever after on .get() with mp.Queue() (Possible Deadlock) #264

Open abhiram1809 opened 12 months ago

abhiram1809 commented 12 months ago

This bug came up when i decided to Train ColBERT on a custom Dataset, but it was taking Forever, so I tried diagnosing the problem, seems that it uses torch.multiprocessing to divide tasks, but whenever a Task Queue is formed, the code gets stuck on the get() method

#Sample code to reproduce the Problem
import torch
import torch.multiprocessing as mp
try:
    mp.set_start_method('spawn', force=True)
except RuntimeError:
    print('Hello')
return_value_queue = mp.Queue()
#return_values = sorted([return_value_queue.get() for _ in all_procs])  #The Code gets stuck here
print(return_value_queue.get()) #To Reproduce

### Versions torch version = 1.13.1+cu117

Occurs while Training

with Run().context(RunConfig(nranks=1, experiment="notebook")):
    config = ColBERTConfig(
        bsize=32,
        root="experiments",
    )
    trainer = Trainer(
        triples="triples.tsv",
        queries="queries.tsv",
        collection="collection.tsv",
        config=config,
    )
    checkpoint_path = trainer.train(checkpoint='colbert-ir/colbertv2.0')
    print(f"Saved checkpoint to {checkpoint_path}...")
okhat commented 11 months ago

maybe OOM error so the child process dies?

CosimoRulli commented 3 months ago

Hey, I am facing the same problem, I guess. Did you find any workaround? @abhiram1809