Open cemeiq opened 3 years ago
Yes, with Terrier it does. The background information is at https://github.com/terrier-org/terrier-core/blob/5.x/doc/configure_indexing.md#fields
IterDictIndexer is the PyTerrier class that most easily exposes the indexing of Terrier fields indexing at present. See https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html#iterdictindexer
Craig
@cemeiq did you manage to achieve what you wanted?
Hi @cmacdonald, today I encountered exactly the same case.
I tried to search for the term "nutrition" in the title field. For this purpose, the terrier query language was used (https://github.com/terrier-org/terrier-core/blob/5.x/doc/querylanguage.md).
index = pt.IndexFactory.of("/workspace/index/data.properties")
bm25 = pt.BatchRetrieve(index, wmodel='BM25')
bm25.search("title:nutrition")
I get this error:
09:03:37.687 [main] ERROR org.terrier.querying.LocalManager - Problem running Matching, returning empty result set as query 1
java.io.IOException: Unknown field TITLE - known fields are [docno, table_content, textBefore, textAfter, pageTitle, title, entities, orientation, url, header, key_col, catchall]
at org.terrier.matching.matchops.SingleTermOp.getPostingIterator(SingleTermOp.java:120)
at org.terrier.matching.matchops.SingleTermOp.getMatcher(SingleTermOp.java:149)
at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:304)
at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:282)
at org.terrier.matching.daat.Full.match(Full.java:88)
at org.terrier.querying.LocalManager$ApplyLocalMatching.process(LocalManager.java:518)
at org.terrier.querying.LocalManager.runSearchRequest(LocalManager.java:895)
This is how I created the index:
indexer = pt.IterDictIndexer("./workspace/index", meta={'docno': 128}, threads=8)
indexref = indexer.index(iter_file(webtable_dump), fields=['docno', 'table_content', 'textBefore', 'textAfter', 'pageTitle',
'title', 'entities', 'orientation', 'url', 'header', 'key_col', 'catchall'])
Is there any other way to search using specific fields of the index?
hi @bjoernengelmann
Thanks for your report. This looked like a bug, perhaps in IterDictIndexer - I will investigate in due course.
I think as a workaround, can you try altering the names of the fields to be uppercase in the data.properties file of the generated index? The field names are held in a property called index.inverted.fields.names
Rough sketch of code to do the same
index = pt.IndexFactory.of("/workspace/index/data.properties")
index = pt.cast("org.terrier.structures.PropertiesIndex", index)
fields = index.getIndexProperty("index.inverted.fields.names", None)
index.setIndexProperty("index.inverted.fields.names", upper(fields))
index.flush()
index = None
Thank you so much, this worked for me!
Does Pyterrier support multifield document retrieval like a document may have other fields like body, conclusion and we may want to search a query among all the fields in each document?