thunlp / OpenMatch

An Open-Source Package for Information Retrieval.
MIT License
447 stars 42 forks source link

How to use the api to initialize model with .bin checkpoints #8

Closed zhenduow closed 4 years ago

zhenduow commented 4 years ago

Hi,

I am trying to train a bert and a KNRM ranker on my own data. I can't use the inference script because my test data does not have the required fields score and some other fields. So, I am trying to use the api directly.

How can I use the api to initialize a model with my .bin checkpoints, like in the example model = om.models.Bert("allenai/scibert_scivocab_uncased") ? I have tried pretrained = mycheckpoints.bin but I see that pretrainedrequires a huggingface checkpoint or a config.json. But there is no config.json in my checkpoint path...

zkt12 commented 4 years ago

Hi,

Thanks for your interest in our toolkit. The score field is retrieval score of BM25/SDM or other retrieval methods. It is used for learning-to-rank method, e.g. coor-ascent, after training neural rerankers. You can just set 0 for this field in reranking scenarios.

For checkpoint loading, we provide -checkpoint. The saved checkpoint need to have the same architecture as our pre-defined model :)

zhenduow commented 4 years ago

Thank you, Toby. I checked the inference.sh and see that the inference runs in batch. Does it mean that doc candidates are from the batch? If that's the case, then that does not work with my task because my data consist of a query with a list of passages to be ranked.

Is there a way that I can run the reranker in "interactive mode" by giving it a query and a doc list, and get returned a ranked list of docs?

Thank you so much!

zkt12 commented 4 years ago

The inference.sh measures each query-doc pair, and outputs the ranking score. When all pairs are measured, the ranking scores are saved to a dict, which is like this: {'q_id1': [(score1, d_id1), (score2, d_id2), ...]}. Finally we write a ranking list to a file in TREC format, the file path is set by -res. You can also change the code in inference.py to skip the file writing step, just use the dict.

Suppose you have a query and a list of docs, you can number the query and docs, and preprocess the data to our input format, each line is like: {'query_id': q_id1, 'doc_id': did_1, 'retrieval_score': 0, 'query1': query, 'doc': doc1}. Then the inference.sh can measure these query-doc pairs, batch by batch, and return the result ranking list.

zhenduow commented 4 years ago

I see what you mean. In my case, I have N queries and each of them has k docs. So it does not work exactly with batch inference. But sure, I guess I can make k-1 empty queries to pair with each of them.

Thank you very much for your instruction!

zkt12 commented 4 years ago

We also have N queries, each of them has k docs, so the total number of q-d pairs is N*k. Suppose the batch size is t, then we have N*k/t batches. The neural ranker takes each batch as input, and output a batch of ranking scores. Each q-d pair in a batch is measured independently. If we have the query_id and doc_id, we can save the ranking score of each q-d pair to a dict, and write a result file. I don't know what you mean 'it does not work exactly with batch inference'.

zhenduow commented 4 years ago

We also have N queries, each of them has k docs, so the total number of q-d pairs is Nk. Suppose the batch size is t, then we have Nk/t batches. The neural ranker takes each batch as input, and output a batch of ranking scores. Each q-d pair in a batch is measured independently. If we have the query_id and doc_id, we can save the ranking score of each q-d pair to a dict, and write a result file. I don't know what you mean 'it does not work exactly with batch inference'.

Oh, now I see. When I said 'it does not work exactly with batch inference' I meant that I thought the candidate docs are selected from the batch (thus a batch of size t will have t true (query, doc) pairs and each query has the t docs in the batch as candidates, making t*t (query, doc) pairs to score). But according to what you describe above, I was wrong.

Thank you so much for the further explanation! I will try it.