Closed zhenduow closed 4 years ago
Hi,
Thanks for your interest in our toolkit. The score
field is retrieval score of BM25/SDM or other retrieval methods. It is used for learning-to-rank method, e.g. coor-ascent, after training neural rerankers. You can just set 0 for this field in reranking scenarios.
For checkpoint loading, we provide -checkpoint
. The saved checkpoint need to have the same architecture as our pre-defined model :)
Thank you, Toby. I checked the inference.sh and see that the inference runs in batch. Does it mean that doc candidates are from the batch? If that's the case, then that does not work with my task because my data consist of a query with a list of passages to be ranked.
Is there a way that I can run the reranker in "interactive mode" by giving it a query and a doc list, and get returned a ranked list of docs?
Thank you so much!
The inference.sh measures each query-doc pair, and outputs the ranking score. When all pairs are measured, the ranking scores are saved to a dict, which is like this: {'q_id1': [(score1, d_id1), (score2, d_id2), ...]}
. Finally we write a ranking list to a file in TREC format, the file path is set by -res
. You can also change the code in inference.py
to skip the file writing step, just use the dict.
Suppose you have a query and a list of docs, you can number the query and docs, and preprocess the data to our input format, each line is like: {'query_id': q_id1, 'doc_id': did_1, 'retrieval_score': 0, 'query1': query, 'doc': doc1}
. Then the inference.sh can measure these query-doc pairs, batch by batch, and return the result ranking list.
I see what you mean. In my case, I have N queries and each of them has k docs. So it does not work exactly with batch inference. But sure, I guess I can make k-1 empty queries to pair with each of them.
Thank you very much for your instruction!
We also have N queries, each of them has k docs, so the total number of q-d pairs is N*k. Suppose the batch size is t, then we have N*k/t batches. The neural ranker takes each batch as input, and output a batch of ranking scores. Each q-d pair in a batch is measured independently. If we have the query_id and doc_id, we can save the ranking score of each q-d pair to a dict, and write a result file. I don't know what you mean 'it does not work exactly with batch inference'.
We also have N queries, each of them has k docs, so the total number of q-d pairs is Nk. Suppose the batch size is t, then we have Nk/t batches. The neural ranker takes each batch as input, and output a batch of ranking scores. Each q-d pair in a batch is measured independently. If we have the query_id and doc_id, we can save the ranking score of each q-d pair to a dict, and write a result file. I don't know what you mean 'it does not work exactly with batch inference'.
Oh, now I see. When I said 'it does not work exactly with batch inference' I meant that I thought the candidate docs are selected from the batch (thus a batch of size t will have t true (query, doc) pairs and each query has the t docs in the batch as candidates, making t*t (query, doc) pairs to score). But according to what you describe above, I was wrong.
Thank you so much for the further explanation! I will try it.
Hi,
I am trying to train a bert and a KNRM ranker on my own data. I can't use the inference script because my test data does not have the required fields
score
and some other fields. So, I am trying to use the api directly.How can I use the api to initialize a model with my .bin checkpoints, like in the example
model = om.models.Bert("allenai/scibert_scivocab_uncased")
? I have triedpretrained = mycheckpoints.bin
but I see thatpretrained
requires a huggingface checkpoint or a config.json. But there is no config.json in my checkpoint path...