terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
415 stars 65 forks source link

Pyterrier support for multifield document retrieval #213

Open cemeiq opened 3 years ago

cemeiq commented 3 years ago

Does Pyterrier support multifield document retrieval like a document may have other fields like body, conclusion and we may want to search a query among all the fields in each document?

cmacdonald commented 3 years ago

Yes, with Terrier it does. The background information is at https://github.com/terrier-org/terrier-core/blob/5.x/doc/configure_indexing.md#fields

IterDictIndexer is the PyTerrier class that most easily exposes the indexing of Terrier fields indexing at present. See https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html#iterdictindexer

Craig

cmacdonald commented 2 years ago

@cemeiq did you manage to achieve what you wanted?

bjoernengelmann commented 1 year ago

Hi @cmacdonald, today I encountered exactly the same case.

I tried to search for the term "nutrition" in the title field. For this purpose, the terrier query language was used (https://github.com/terrier-org/terrier-core/blob/5.x/doc/querylanguage.md).

index = pt.IndexFactory.of("/workspace/index/data.properties")
bm25 = pt.BatchRetrieve(index, wmodel='BM25')
bm25.search("title:nutrition")

I get this error:

09:03:37.687 [main] ERROR org.terrier.querying.LocalManager - Problem running Matching, returning empty result set as query 1
java.io.IOException: Unknown field TITLE - known fields are [docno, table_content, textBefore, textAfter, pageTitle, title, entities, orientation, url, header, key_col, catchall]
    at org.terrier.matching.matchops.SingleTermOp.getPostingIterator(SingleTermOp.java:120)
    at org.terrier.matching.matchops.SingleTermOp.getMatcher(SingleTermOp.java:149)
    at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:304)
    at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:282)
    at org.terrier.matching.daat.Full.match(Full.java:88)
    at org.terrier.querying.LocalManager$ApplyLocalMatching.process(LocalManager.java:518)
    at org.terrier.querying.LocalManager.runSearchRequest(LocalManager.java:895)

This is how I created the index:

indexer = pt.IterDictIndexer("./workspace/index", meta={'docno': 128}, threads=8)

indexref = indexer.index(iter_file(webtable_dump), fields=['docno', 'table_content', 'textBefore', 'textAfter', 'pageTitle',
            'title', 'entities', 'orientation', 'url', 'header', 'key_col', 'catchall'])

Is there any other way to search using specific fields of the index?

cmacdonald commented 1 year ago

hi @bjoernengelmann

Thanks for your report. This looked like a bug, perhaps in IterDictIndexer - I will investigate in due course.

I think as a workaround, can you try altering the names of the fields to be uppercase in the data.properties file of the generated index? The field names are held in a property called index.inverted.fields.names

cmacdonald commented 1 year ago

Rough sketch of code to do the same

index = pt.IndexFactory.of("/workspace/index/data.properties")
index = pt.cast("org.terrier.structures.PropertiesIndex", index)
fields = index.getIndexProperty("index.inverted.fields.names", None)
index.setIndexProperty("index.inverted.fields.names", upper(fields))
index.flush()
index = None
bjoernengelmann commented 1 year ago

Thank you so much, this worked for me!