Closed JMMackenzie closed 9 months ago
Hi both,
This is a long standing thing in Terrier. People tend to create indices with fields, then complain it doesn't fit in memory when they aren't using a weighting model that needs fields.
This has been exacerbated by IterDictIndexer creating field indices by default (it shouldn't, we need to make a separate a FieldIterDictIndexer for that - see longstanding #101). Field-based models are a rare enough problem behaviour that we shouldnt change the behaviour of IterDictIndexer to make FSAFieldDocumentIndex yet.
I think the interim way forward would be for FSADocumentIndex to emit a warning if number of fields > 1. That warning should be visible in PyTerrier.
Meanwhile, the workaround here is to patch the index before use:
index = pt.IndexFactory.of("/path/to/index")
pindex = cast("org.terrier.structures.IndexOnDisk", index)
pindex.setIndexProperty("index.document.class", "org.terrier.structures.FSAFieldDocumentIndex")
pindex.structureCache.remove("document")
pindex.structureCache.remove("inverted")
pindex.getIndexStructure("inverted")
pindex.getIndexStructure("document")
bm25f = pt.BatchRetrieve(pindex, "BM25F")
I think the interim way forward would be for FSADocumentIndex to emit a warning if number of fields > 1. That warning should be visible in PyTerrier.
Proposal here: https://github.com/terrier-org/terrier-core/compare/fields_docindex_warning?expand=1 Its only enabled for fields > 1
Thanks Craig, that's very helpful to know. We somehow had a configuration file with the right document class initially (and retrieval was very fast) but saw huge regressions (I think it was ~15x slower) and tracked it down to being this. Not sure how we got the "right" config in the first place, maybe it was a hangover from an older experiment with Terrier that we revived.
I think the warning is a sufficient work-around for now, agreed. Thanks for your help.
Describe the bug When indexing with fields, the output configuration (
data.properties
) file seems to not include the optimization for fields that improves retrieval efficiency. CC @Watheq9Expected to see
org.terrier.structures.FSAFieldDocumentIndex
but getorg.terrier.structures.FSADocumentIndex
To Reproduce
Here is the indexing script:
More Info
Hunting around a bit in Terrier, it seems that it should automatically be outputting the correct configuration since
fields > 0
(see here for example: https://github.com/terrier-org/terrier-core/blob/810d436f201ca9bc5881fa052c0b2cc9130ebc89/modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java#L212).Any ideas? Thanks for your help. Please move to Terrier if that's better.