terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
407 stars 64 forks source link

DFIndexer error messages #348

Closed maxhenze closed 1 year ago

maxhenze commented 1 year ago

While reproducing the PyTerrier Indexing Notebook a lot of JVM exceptions occur in my notebook.

Explicitly I'm loading an ir_dataset and try to index:

import ir_datasets
import pandas as pd
import pyterrier as pt

if not pt.started():
  pt.init()

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.queries_iter())
df_qrels = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.qrels_iter())
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.doc_id.unique()))

index_path = "./pd_index"
indexer = pt.DFIndexer(index_path, verbose=True)

indexref = indexer.index(df_docs["body"])

This results in the following error:


  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: For input string: "" java.lang.NumberFormatException

Additionally, if I try to use the indexer as follows:

indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

the following error occurs:

  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [41], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: Could not instantiate MetaIndexBuilder org.terrier.structures.indexing.ZstdMetaIndexBuilder java.lang.IllegalArgumentException

PyTerrier is updated to the newest version. The documents I want to index have at least a length of 3 characters.

When trying to index only the first 5 documents the problem still persits

Because I'm following the notebook, I would expect the code to work as stated there.

Could it be a problem with my java version or the pt.init()? Please let me know if additional information is needed.

cmacdonald commented 1 year ago

Hi Max

This smells like a mismatch between the PyTerrier version and the underlying Java jar files. Can you show us what PyTerrier says after pt.init()?

maxhenze commented 1 year ago

Hi Craig,

of course.

It reports:

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
cmacdonald commented 1 year ago

Ok, so not version problem. We dont use DFIndexer a lot now, as its functionality is subsumed by IterDictIndexer (discussed below.). Specifically, for your first example:

indexref = indexer.index(df_docs["body"]) 

Pretty sure this is wrong, as you need a docno.

Instead, the following works:

indexref = indexer.index(df_docs["body"], docno=df_docs['doc_id'])`

I would encourage you to: (a) Use pt.IterDictIndexer, as you can also index dataframes. DFIndexer promotes use of corpora as dataframes, which assumes they can be held in memory. Instead and dataframe can be converted to "iter-dict" and indexed as that:

 pt.IterDictIndexer('./idi_index').index(df_docs.rename(columns={'doc_id':'docno', 'body' : 'text'}).to_dict(orient='records'))

(b) Not to create a dataframe in the first place for a large collection like MSMARCO, as you can index the yield generator?

PS: I agree that for both your options DFIndexer could have had better error handling.

maxhenze commented 1 year ago

Works like a charm. Thank you, for your fast replies.

The problem is resolved, but I have a follow-up quesion.

Let's say I'm doing an experiment like follows:

import pyterrier as pt

dataset = pt.get_dataset("trec-deep-learning-docs")

bm25 = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")

pt.Experiment(
    [bm25],
    dataset.get_topics("test"),
    dataset.get_qrels("test"),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

This would rank 43 queries with 16K qrels to 3.2M documents. This results in the following scores: grafik

But of course the 16K qrels don't connect to all 3.2M documents, thus I could to the scoring only on the Documents occuring in the Qrels.

This is btw. the problem I'm trying to resolve, because I wan't to manipulate the document text and this would to intensive (and wasted) if I would do it on all 3.2M documents.

With your mentioned approach I would thus do the following:

import ir_datasets
import pyterrier as pt

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pt.get_dataset("trec-deep-learning-docs").get_topics("test")
df_qrels = pt.get_dataset("trec-deep-learning-docs").get_qrels("test")
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.docno.unique()))

df_docs = df_docs.rename(columns={'body':'text', 'doc_id':'docno'})

index_path = "./pd_index"
indexer = pt.IterDictIndexer(index_path)

indexref = indexer.index(df_docs.to_dict(orient="records"))

bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")

pt.Experiment(
    [bm25],
    df_queries,
    df_qrels,
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

But this results in: grafik

What might be the part I'm missing ? The stemmer and stopwords settings should be the default settings, thus I didn't manually set them in the indexer.

cmacdonald commented 1 year ago

Your post doesnt make clear what is unexpected in the results.

To debug...

You may also want to change the number of results retrieved:

index = pt.IndexFactory.of(indexref)
bm25 = pt.BatchRetrieve(index, wmodel='BM25', num_results=len(index))
maxhenze commented 1 year ago

Nevermind. It seems like I mixed a few things up. Nevertheless, thanks for your help 👍