Open lboesen opened 2 years ago
does this have something to do with the --fp16 flag when training the model?
Hi @lboesen, The colbert example here is only for training the model right now. It hasn't been tested for retrieval. Colbert is a multi-vector retrieval model, so the inference/search is not supported by tevatron yet. I'd suggest following the original ColBERT repo to train the model and do search https://github.com/stanford-futuredata/ColBERT
Thank you for your quick reply and yes I will have a look at the original colbert repo.
Do you by anychange know if the colbert model's training parameters set in the tevatron - gives effectivness score equal to the original ColBERT where they measure mrr@10 = 36.0
Hi,
I experienced issues when working with the colbert example. I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert
I then encoded the corpus and queries:
corpus: python -m tevatron.driver.encode\ --output_dir=temp \ --model_name_or_path bert-base-uncased \ --fp16 \ --per_device_eval_batch_size 156 \ --p_max_len 128 \ --dataset_name Tevatron/msmarco-passage-corpus \ --encoded_save_path /corpus_emb_colbert/ \ --encode_num_shard 20 \ --encode_shard_index {s}
queries: python -m tevatron.driver.encode\ --output_dir=temp \ --model_name_or_path bert-base-uncased \ --fp16 \ --per_device_eval_batch_size 156 \ --encode_is_qry \ --q_max_len 32 \ --dataset_name Tevatron/msmarco-passage/dev \ --encoded_save_path /queries_emb.tsv"
When trying to index using:
python -m pyserini.index.lucene \ --collection JsonVectorCollection \ --input /model_runs/corpus_emb_colbert \ --index /model_runs/index_colbert \ --generator DefaultLuceneDocumentGenerator \ --threads 12 \ --impact --pretokenized --optimize
it failed with the following messeage:
2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (BufferedReader); line: 1, column: 2] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?] at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:
python -m tevatron.faiss_retriever \ --query_reps /home/fdt672/model_runs/queries_embcolbert{train_split}/queries_emb_train_split_20.tsv\ --passage_reps /home/fdt672/model_runs/corpus_embcolbert{train_split}/'*.jsonl'\ --depth 100 \ --batch_size -1 \ --save_text \ --save_ranking_to /home/fdt672/model_runs/rankcolbert{train_split}.txt
But it also faulted with:
Traceback (most recent call last): File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'
Solution: As I understand the issues was that the value need to be float32 and not float16: So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)
the tevatron.faiss_retriever worked.
I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)
I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?
Thanks alot in advance :)