Regarding ColBERT Training Data

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.82k stars 369 forks source link

Regarding ColBERT Training Data #153

Open LakshKD opened 1 year ago

LakshKD commented 1 year ago

Hi Team,

I was going through the ColBERT v1 code and was able to run it on triples.train.small.tsv(from MSMARCO). I have a doubt regarding the training file format: Is it possible to train the ColBERT v1 model by providing the data of query and positive document and let the model create the negatives from random passages in the batch? I want to train it on my dataset from scratch and wanted to directly provide the query_text, positive_document_text while model can create the negatives while training.

Looking forward for the response. Thanks for this wonderful code and awesome paper :)

LakshKD commented 1 year ago

Also I have this another doubt regarding using custom tokenizer:

I want to use a custom SentencePiece tokenizer, let's say if I want to train ColBERT on other languages instead of English. Is it possible ?

Thanks

okhat commented 1 year ago

You do need to select the negatives before training, but we have a utility/ directory with scripts that create triples for you.

ColERTv2 now supports any encoder, so non-English encoders will work fine. You need to pass checkpoint='name-of-encoder' instead of bert-base-uncased.

lifelongeek commented 1 year ago

@okhat Could you provide example arguments when using utility/triples.py? I can see arguments --ranking, --output, --positives, --depth, --permissive, --biased, --seed and it would be nice to understand if you provide any example for each arguments.

salbatarni commented 5 months ago

@okhat I see a doc string saying: "Note that config.checkpoint is ignored. Only the supplied checkpoint here is used." in trainer.py So is the checkpoint you are talking about is different than config.checkpoint?