stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

incremental training on colbert #255

Open SuradhyakshaDasa opened 1 year ago

SuradhyakshaDasa commented 1 year ago

import torch from colbert import Trainer from colbert.infra import Run, RunConfig, ColBERTConfig import time if name=='main': with Run().context(RunConfig(nranks=1, experiment="msmarco")):

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    config = ColBERTConfig(

        lr=1e-5,
        root="experiments"                      
        # device=device,
    )

    start = time.time()
    trainer = Trainer(
        triples="QP_output-v1.jsonl",
        queries="QP_query_train.tsv",
        collection="QP_collection_train.tsv", 
        config=config,
    )
    checkpoint_path = trainer.train()
    print(f'Time taken for training: {time.time()-start}')

For example, i train a model using 100K records with is code mentioned. Then i use new trained model as base model and train on the next 100K records. As the data i have is very huge, i would like to do it incrementally.