Closed maxhenze closed 1 year ago
Hi Max
This smells like a mismatch between the PyTerrier version and the underlying Java jar files. Can you show us what PyTerrier says after pt.init()?
Hi Craig,
of course.
It reports:
PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7
No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Ok, so not version problem. We dont use DFIndexer a lot now, as its functionality is subsumed by IterDictIndexer (discussed below.). Specifically, for your first example:
indexref = indexer.index(df_docs["body"])
Pretty sure this is wrong, as you need a docno.
Instead, the following works:
indexref = indexer.index(df_docs["body"], docno=df_docs['doc_id'])`
I would encourage you to: (a) Use pt.IterDictIndexer, as you can also index dataframes. DFIndexer promotes use of corpora as dataframes, which assumes they can be held in memory. Instead and dataframe can be converted to "iter-dict" and indexed as that:
pt.IterDictIndexer('./idi_index').index(df_docs.rename(columns={'doc_id':'docno', 'body' : 'text'}).to_dict(orient='records'))
(b) Not to create a dataframe in the first place for a large collection like MSMARCO, as you can index the yield generator?
PS: I agree that for both your options DFIndexer could have had better error handling.
Works like a charm. Thank you, for your fast replies.
The problem is resolved, but I have a follow-up quesion.
Let's say I'm doing an experiment like follows:
import pyterrier as pt
dataset = pt.get_dataset("trec-deep-learning-docs")
bm25 = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")
pt.Experiment(
[bm25],
dataset.get_topics("test"),
dataset.get_qrels("test"),
eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
names=["BM25"],
)
This would rank 43 queries with 16K qrels to 3.2M documents. This results in the following scores:
But of course the 16K qrels don't connect to all 3.2M documents, thus I could to the scoring only on the Documents occuring in the Qrels.
This is btw. the problem I'm trying to resolve, because I wan't to manipulate the document text and this would to intensive (and wasted) if I would do it on all 3.2M documents.
With your mentioned approach I would thus do the following:
import ir_datasets
import pyterrier as pt
dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()
df_queries = pt.get_dataset("trec-deep-learning-docs").get_topics("test")
df_qrels = pt.get_dataset("trec-deep-learning-docs").get_qrels("test")
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.docno.unique()))
df_docs = df_docs.rename(columns={'body':'text', 'doc_id':'docno'})
index_path = "./pd_index"
indexer = pt.IterDictIndexer(index_path)
indexref = indexer.index(df_docs.to_dict(orient="records"))
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")
pt.Experiment(
[bm25],
df_queries,
df_qrels,
eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
names=["BM25"],
)
But this results in:
What might be the part I'm missing ? The stemmer
and stopwords
settings should be the default settings, thus I didn't manually set them in the indexer.
Your post doesnt make clear what is unexpected in the results.
To debug...
You may also want to change the number of results retrieved:
index = pt.IndexFactory.of(indexref)
bm25 = pt.BatchRetrieve(index, wmodel='BM25', num_results=len(index))
Nevermind. It seems like I mixed a few things up. Nevertheless, thanks for your help 👍
While reproducing the PyTerrier Indexing Notebook a lot of JVM exceptions occur in my notebook.
Explicitly I'm loading an ir_dataset and try to index:
This results in the following error:
Additionally, if I try to use the indexer as follows:
the following error occurs:
PyTerrier is updated to the newest version. The documents I want to index have at least a length of 3 characters.
When trying to index only the first 5 documents the problem still persits
Because I'm following the notebook, I would expect the code to work as stated there.
Could it be a problem with my java version or the
pt.init()
? Please let me know if additional information is needed.