texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
518 stars 100 forks source link

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

Open lboesen opened 1 year ago

lboesen commented 1 year ago

Hi,

I experienced issues when working with the colbert example. I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert

I then encoded the corpus and queries:

corpus: python -m tevatron.driver.encode\ --output_dir=temp \ --model_name_or_path bert-base-uncased \ --fp16 \ --per_device_eval_batch_size 156 \ --p_max_len 128 \ --dataset_name Tevatron/msmarco-passage-corpus \ --encoded_save_path /corpus_emb_colbert/ \ --encode_num_shard 20 \ --encode_shard_index {s}

queries: python -m tevatron.driver.encode\ --output_dir=temp \ --model_name_or_path bert-base-uncased \ --fp16 \ --per_device_eval_batch_size 156 \ --encode_is_qry \ --q_max_len 32 \ --dataset_name Tevatron/msmarco-passage/dev \ --encoded_save_path /queries_emb.tsv"

When trying to index using:

python -m pyserini.index.lucene \ --collection JsonVectorCollection \ --input /model_runs/corpus_emb_colbert \ --index /model_runs/index_colbert \ --generator DefaultLuceneDocumentGenerator \ --threads 12 \ --impact --pretokenized --optimize

it failed with the following messeage:

2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (BufferedReader); line: 1, column: 2] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?] at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?] at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?] at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?] at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?] at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')

I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:

python -m tevatron.faiss_retriever \ --query_reps /home/fdt672/model_runs/queries_embcolbert{train_split}/queries_emb_train_split_20.tsv\ --passage_reps /home/fdt672/model_runs/corpus_embcolbert{train_split}/'*.jsonl'\ --depth 100 \ --batch_size -1 \ --save_text \ --save_ranking_to /home/fdt672/model_runs/rankcolbert{train_split}.txt

But it also faulted with:

Traceback (most recent call last): File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in main() File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main retriever.add(p_reps) File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add self.index.add(p_reps) File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add self.add_c(n, swig_ptr(x)) File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x) TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'

Solution: As I understand the issues was that the value need to be float32 and not float16: So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)

_class BaseFaissIPRetriever:
    def __init__(self, init_reps: np.ndarray):
        index = faiss.IndexFlatIP(init_reps.shape[1])
        self.index = index

    def add(self, p_reps: np.ndarray):
        **p_reps_float32 = p_reps.astype(np.float32)** #  <------- issues with float16

        self.index.add(p_reps_float32)
    def search(self, q_reps: np.ndarray, k: int):
        **q_reps_float32 = q_reps.astype(np.float32)** # < ------- issues with float16

        return self.index.search(q_reps_float32, k)
       .....

the tevatron.faiss_retriever worked.

I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)

I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?

Thanks alot in advance :)

lboesen commented 1 year ago

does this have something to do with the --fp16 flag when training the model?

MXueguang commented 1 year ago

Hi @lboesen, The colbert example here is only for training the model right now. It hasn't been tested for retrieval. Colbert is a multi-vector retrieval model, so the inference/search is not supported by tevatron yet. I'd suggest following the original ColBERT repo to train the model and do search https://github.com/stanford-futuredata/ColBERT

lboesen commented 1 year ago

Thank you for your quick reply and yes I will have a look at the original colbert repo.

Do you by anychange know if the colbert model's training parameters set in the tevatron - gives effectivness score equal to the original ColBERT where they measure mrr@10 = 36.0