Multiple fields for text.scorer (besides body_attr)

terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

https://pyterrier.readthedocs.io/

Mozilla Public License 2.0

415 stars 65 forks source link

Multiple fields for text.scorer (besides body_attr) #450

Closed albertoueda closed 2 months ago

albertoueda commented 2 months ago

Is it possible to use multiple fields in text.scorer?

Context:

I have an index with multiple fields and metadata.
I would like to rank a new document content (not indexed) with BM25-F in multiple fields, using the background index as mentioned in the documentation.
I can set the body_attr to one of the columns, but how should I proceed in the case of multiple fields?

If it is not possible today, maybe body_attr could accept a list, and/or be renamed to text_cols, or text_attrs.

Also, maybe, if the new document we want to rank has fields corrresponding to metadata/fields available in the background index, they can be matched automatically.

cmacdonald commented 2 months ago

So you are right, in that it doesnt currentl, but it could be extended to do so...

But I think I would ask is there an easier way to implement it for now?

Can you write a function that tokenises queries and several fields and looks up the relevant stats in a Terrier lexicon, i.e. to calculate BM25F manually in python?

albertoueda commented 2 months ago

I'm afraid I'm not that expert in handling Terrier lexicons. I had another option here, that is indexing the new documents (and their fields) together with the initial documents (actually they are not "new" ones, they are simply processed versions of their indexed ones).

In this direction, is there a way to index new documents with pyterrier after an initial index is built? I've noticed there is incremental indexing in Terrier, but are they possible to PyTerrier indexing with IterDictIndexer's, for instance?

Should I close this issue?

cmacdonald commented 2 months ago

You can use the + operator on two Terrier indices and retrieve from the combined "virtual index". See example https://github.com/terrier-org/pyterrier/blob/master/tests/test_index_op.py#L128

One index could be your original documents, and the new index contain your new documents.

cmacdonald commented 2 months ago

If you are happy @albertoueda perhaps we can close this?