terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

Fielded Model configuration file issues #405

Closed JMMackenzie closed 9 months ago

JMMackenzie commented 9 months ago

Describe the bug When indexing with fields, the output configuration (data.properties) file seems to not include the optimization for fields that improves retrieval efficiency. CC @Watheq9

Expected to see org.terrier.structures.FSAFieldDocumentIndex but get org.terrier.structures.FSADocumentIndex

To Reproduce

Here is the indexing script:

import numpy as np
import pyterrier as pt
import pandas as pd

file = "/path/to/msmarco-v1/first_100K_rows.jsonl"
TEXT = "text"
index_path = "/path/to/msmarco-v1/indexes/first_100K_index"
TITLE = "title"

if not pt.started():
    pt.init(tqdm='tqdm', version='5.7', helper_version='0.0.6')

df_col1 =  pd.read_json(file, lines=True)
df_col1['id'] = df_col1['id'].astype('str')

def get_document(df):
    for i, row in df.iterrows():
        yield {"docno": row["id"], TEXT: row[TEXT], TITLE: row[TITLE]}

# build multi-field index
iter_indexer = pt.IterDictIndexer(index_path, overwrite=True, verbose=True)
iter_indexer.setProperty("tokeniser", "EnglishTokeniser")
# perform indexing
indexref = iter_indexer.index(get_document(df_col1), fields=["text", "title"],
                              meta=['docno'])

print("Done indexing")

More Info

Hunting around a bit in Terrier, it seems that it should automatically be outputting the correct configuration since fields > 0 (see here for example: https://github.com/terrier-org/terrier-core/blob/810d436f201ca9bc5881fa052c0b2cc9130ebc89/modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java#L212).

Any ideas? Thanks for your help. Please move to Terrier if that's better.

cmacdonald commented 9 months ago

Hi both,

This is a long standing thing in Terrier. People tend to create indices with fields, then complain it doesn't fit in memory when they aren't using a weighting model that needs fields.

This has been exacerbated by IterDictIndexer creating field indices by default (it shouldn't, we need to make a separate a FieldIterDictIndexer for that - see longstanding #101). Field-based models are a rare enough problem behaviour that we shouldnt change the behaviour of IterDictIndexer to make FSAFieldDocumentIndex yet.

I think the interim way forward would be for FSADocumentIndex to emit a warning if number of fields > 1. That warning should be visible in PyTerrier.

Meanwhile, the workaround here is to patch the index before use:

index = pt.IndexFactory.of("/path/to/index")
pindex = cast("org.terrier.structures.IndexOnDisk", index)
pindex.setIndexProperty("index.document.class", "org.terrier.structures.FSAFieldDocumentIndex")
pindex.structureCache.remove("document")
pindex.structureCache.remove("inverted")
pindex.getIndexStructure("inverted")
pindex.getIndexStructure("document")
bm25f = pt.BatchRetrieve(pindex, "BM25F")
cmacdonald commented 9 months ago

I think the interim way forward would be for FSADocumentIndex to emit a warning if number of fields > 1. That warning should be visible in PyTerrier.

Proposal here: https://github.com/terrier-org/terrier-core/compare/fields_docindex_warning?expand=1 Its only enabled for fields > 1

JMMackenzie commented 9 months ago

Thanks Craig, that's very helpful to know. We somehow had a configuration file with the right document class initially (and retrieval was very fast) but saw huge regressions (I think it was ~15x slower) and tracked it down to being this. Not sure how we got the "right" config in the first place, maybe it was a hangover from an older experiment with Terrier that we revived.

I think the warning is a sufficient work-around for now, agreed. Thanks for your help.