Can not import the retrieval from anserini

Jia-py commented 1 year ago

Hi! Thanks for your great work! I want to do experiments on terrier with the bm25 from anserini. But I met the following error: jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

Here is my code:

>>> indexref = pt.IndexRef.of('./index/data.properties')
>>> index = pt.IndexFactory.of(indexref)
>>> bm25 = pt.anserini.AnseriniBatchRetrieve(index,wmodel='BM25')

I have installed the jdk11 and found the code in documentation:

trIndex = "/path/to/data.properties"
luceneIndex = "/path/to/lucene-index-dir"
BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")

Could it be because I haven't imported the Lucene index? I'm currently using another open-source search library, beir, and have Elasticsearch running. Is there a convenient way to obtain the Lucene index that can be read in this context?

Thanks.

cmacdonald commented 1 year ago

Did you start pt.init() with anserini in the boot_classpath? like https://github.com/terrier-org/pyterrier/blob/master/tests/anserini/test_anserini.py#L23

I think we have only tested with 0.9.2 which is probably old now. Which version of Anserini are you using?

cmacdonald commented 1 year ago

NoClassDefFoundError is usually because of either the JVM process was forked, or a dependency jar file was missing from the classpath. Because we ask for the fatjar, it should be included.

Can you show the Python stack trace? A colab with mimium working example would be helpful.

Jia-py commented 1 year ago

Thanks for your reply. Yes, I started pt.init() with anserini 0.9.2 and here is the python stack trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 17, in <module>
    from ._base import JQuery, JQueryGenerator, JDisjunctionMaxQueryGenerator, get_topics,\
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_base.py", line 35, in <module>
    JQrels = autoclass('io.anserini.eval.Qrels')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

cmacdonald commented 1 year ago

Ah, I forgot pyserini is involved. I think the pyserini version has to match the anserini version. Our unit tests use pyserini==0.9.4.

If your anserini index is newer than that, then you can try upgrading. I'm happy to have a PR for more recent pyserini support, but its not something we use ourselves.

Jia-py commented 1 year ago

Thank you, I'll have a try.

cmacdonald commented 1 year ago

(I'm also thinking that Anserini support could move from Pyterrier itself into a smaller separate repo, just like we do for pyterrier_colbert etc). That would enable better unit testing etc.

cmacdonald commented 1 year ago

Let me know how you get on.

Jia-py commented 1 year ago

I degrade the pyterrier to 0.9.4, and download the lucene index from pyserini, I met this error.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 48, in __init__
    self.object = JSimpleSearcher(JString(index_dir))
  File "jnius/jnius_export_class.pxi", line 270, in jnius.JavaClass.__init__
  File "jnius/jnius_export_class.pxi", line 384, in jnius.JavaClass.call_constructor
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

And I upgrade the anserini by pt.init(boot_packages=["io.anserini:anserini:0.22.0:fatjar"])

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 49, in __init__
    self.num_docs = self.object.getTotalNumDocuments()
AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

cmacdonald commented 1 year ago

I degrade the pyterrier to 0.9.4,

You mean pyserini to 0.9.4?

jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

This error is because anserini and pyserini is too old for your index. So newer is needed

AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

This error is because your pyserini version does not match your anserini fat jar. You have to keep them in sync somehow.

Jia-py commented 1 year ago

Thanks for your reply! I followed your advice and used anserini and pyserini both at 0.22.0. There is still something wrong with it.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 19, in <module>
    from .lucene import JLuceneSearcherResult, LuceneSimilarities, LuceneFusionSearcher, LuceneSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/__init__.py", line 18, in <module>
    from ._impact_searcher import JImpactSearcherResult, LuceneImpactSearcher, SlimSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/_impact_searcher.py", line 34, in <module>
    from pyserini.index import Document
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/__init__.py", line 21, in <module>
    from .lucene._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/__init__.py", line 17, in <module>
    from ._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/_base.py", line 30, in <module>
    from pyserini.analysis import get_lucene_analyzer, JAnalyzer, JAnalyzerUtils
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/__init__.py", line 17, in <module>
    from ._base import get_lucene_analyzer, Analyzer, JAnalyzer, JAnalyzerUtils, JDefaultEnglishAnalyzer, JWhiteSpaceAnalyzer
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/_base.py", line 26, in <module>
    JDanishAnalyzer = autoclass('org.apache.lucene.analysis.da.DanishAnalyzer')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Bad type on operand stack
Exception Details:
  Location:
    org/apache/lucene/analysis/da/DanishAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents; @65: invokespecial
  Reason:
    Type 'org/tartarus/snowball/ext/DanishStemmer' (current frame, stack[3]) is not assignable to 'org/tartarus/snowball/SnowballStemmer'
  Current Frame:
    bci: @65
    flags: { }
    locals: { 'org/apache/lucene/analysis/da/DanishAnalyzer', 'java/lang/String', 'org/apache/lucene/analysis/Tokenizer', 'org/apache/lucene/analysis/TokenStream' }
    stack: { uninitialized 53, uninitialized 53, 'org/apache/lucene/analysis/TokenStream', 'org/tartarus/snowball/ext/DanishStemmer' }
  Bytecode:
    0000000: bb00 0959 b700 0a4d bb00 0b59 2cb7 000c
    0000010: 4ebb 000d 592d 2ab4 000e b700 0f4e 2ab4
    0000020: 0008 b600 109a 0010 bb00 1159 2d2a b400
    0000030: 08b7 0012 4ebb 0013 592d bb00 1459 b700
    0000040: 15b7 0016 4ebb 0017 592c 2db7 0018 b0  
  Stackmap Table:
    append_frame(@53,Object[#57],Object[#58])
 java.lang.VerifyError

cmacdonald commented 1 year ago

Gosh, I have never seen this error before.

I think, maybe, that Terrier ships with one of Snowballs' Danish stemmer, and Lucene ships with another. If this is the case, a bit of hacking will be needed to address this.

cmacdonald commented 1 year ago

Have you considered just using the results file output from anserini, and using pt.Transformer.from_df(pt.io.read_results(file)) instead.

Jia-py commented 1 year ago

Thank you for your advice! But I didn't generate any result files with Pyserini.

I decided to just use pyterrier to finish my work now. By the way, after retrieving, I got a dataframe that contains fields such as qid, docid, docno, rank, score, and query. How can I access the doc corpus using docid or docno directly? I mean, for some datasets (e.g., beir/trec-covid), I just found the dataset.get_corpus_iter() function is supported to iter the corpus, but can not get the wanted text straightforwardly.

Jia-py commented 1 year ago

Problem resolved. The doc text can be accessed by index.getMetaIndex()

terrier-org / pyterrier

Can not import the retrieval from anserini #396