stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.67k stars 355 forks source link

Issue: Training "resume" and "resume_optimizer" implementation was removed #307

Open eercanayar opened 4 months ago

eercanayar commented 4 months ago

Hello all,

I would like to resume training from the last checkpoint and last batch ID to handle training interruptions. I see some remainders from possible implementations here, but they're commented out.

https://github.com/stanford-futuredata/ColBERT/blob/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Also, #43 mentions about resume_optimizer is implemented, however there is no other reference to the parsed argument.

grep -r "resume_optimizer" .
./colbert/utils/parser.py:        # NOTE: Providing a checkpoint is one thing, --resume is another, --resume_optimizer is yet another.
./colbert/utils/parser.py:        self.add_argument('--resume_optimizer', dest='resume_optimizer', default=False, action='store_true')

So, it seems like this feature after these implementations. I tried to dig into this, and found that removed on (October 13th, 2021 7:40 PM) Initial commit with the new API and residual compression by @okhat Reference: https://github.com/stanford-futuredata/ColBERT/blame/7be0114f00dc938aca4a3a5929bef5bbb99485e6/colbert/training/training.py#L81-L83

Could you help me how can I implement these resume and resume_optimizer again? So, I can handle training interruptions in my pipeline, and also contribute back to the repository with examples.