terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

Error when creating index for beir/dbpedia #412

Closed Jia-py closed 8 months ago

Jia-py commented 8 months ago

Describe the bug Got the following error when creating index for dbpedia

Traceback (most recent call last):
  File "run.py", line 178, in <module>
    main(args)
  File "run.py", line 30, in main
    indexref = indexer.index(dataset.get_corpus_iter())
  File "/home/work/.local/pyterrier/index.py", line 983, in index
    ParallelIndexer.buildParallel(j_collections, self.index_dir, Indexer, Merger)
  File "jnius/jnius_export_class.pxi", line 877, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 1060, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.util.concurrent.ExecutionException: java.util.NoSuchElementException java.lang.RuntimeException

To Reproduce Steps to reproduce the behavior:

  1. Which index - beir/dbpedia

The code I used:

dataset = pt.datasets.get_dataset('irds:beir/dbpedia-entity/test')
indexer = pt.IterDictIndexer('./index/{}'.format(args.dataset.replace('/','-')), meta={'docno':39, args.doc_field:4096}, meta_reverse=['docno','text'])
indexref = indexer.index(dataset.get_corpus_iter())
index = pt.IndexFactory.of(indexref)
seanmacavaney commented 8 months ago

Hi Jia,

I wasn't able to reproduce the issue using the PyTerrier sample code from this page: https://ir-datasets.com/beir#beir/dbpedia-entity

>>> import pyterrier as pt
>>> pt.init()
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity')
>>> indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
>>> index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
...
beir/dbpedia-entity documents: 100%|████| 4635922/4635922 [06:17<00:00, 12293.51it/s]
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
>>> index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
>>> pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
>>> pipeline(dataset.get_topics())
...
                    qid    docid                                            docno  rank      score                                  query
0      INEX_LD-20120112  2007656         <dbpedia:Terminology_of_the_Vietnam_War>     0  25.399387                      vietnam war facts
1      INEX_LD-20120112  2723675             <dbpedia:Leaders_of_the_Vietnam_War>     1  25.087267                      vietnam war facts
2      INEX_LD-20120112  1225602    <dbpedia:List_of_songs_about_the_Vietnam_War>     2  25.045166                      vietnam war facts
3      INEX_LD-20120112  2325520  <dbpedia:The_Quicksand_War:_Prelude_to_Vietnam>     3  24.620978                      vietnam war facts
4      INEX_LD-20120112  1820977            <dbpedia:Legality_of_the_Vietnam_War>     4  24.605000                      vietnam war facts
...                 ...      ...                                              ...   ...        ...                                    ...
65114    TREC_Entity-17  1871347                   <dbpedia:The_Hazel_Scott_Show>   995  18.324102  chefs with a show on the food network
65115    TREC_Entity-17  2361894                        <dbpedia:RPM_(TV_series)>   996  18.324102  chefs with a show on the food network
65116    TREC_Entity-17  3073627               <dbpedia:Deutschlands_MeisterKoch>   997  18.320805  chefs with a show on the food network
65117    TREC_Entity-17   525996                   <dbpedia:Heinz_Winkler_(chef)>   998  18.308566  chefs with a show on the food network
65118    TREC_Entity-17  1961614                   <dbpedia:Matthew_Levin_(chef)>   999  18.308566  chefs with a show on the food network

[65119 rows x 6 columns]

Can you provide more details about the indexing setup that caused the error?

Thanks, sean

cmacdonald commented 8 months ago

We need to see the Java side of the exception. Could you try...

from jnius import JavaException

try:
  index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
except JavaException as ja:
  print('\n\t'.join(ja.stacktrace))
  raise ja
Jia-py commented 8 months ago

Hi @cmacdonald @seanmacavaney , thanks for your reply. I changed the length of docno from 39 to 200, and it worked.

seanmacavaney commented 8 months ago

@Jia-py -- it's usually a good idea to take the PyTerrier samples from https://ir-datasets.com/. Especially to handle things like the maximum docno length.