Closed Jia-py closed 8 months ago
Hi Jia,
I wasn't able to reproduce the issue using the PyTerrier sample code from this page: https://ir-datasets.com/beir#beir/dbpedia-entity
>>> import pyterrier as pt
>>> pt.init()
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity')
>>> indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
>>> index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
...
beir/dbpedia-entity documents: 100%|████| 4635922/4635922 [06:17<00:00, 12293.51it/s]
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
>>> index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
>>> pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
>>> pipeline(dataset.get_topics())
...
qid docid docno rank score query
0 INEX_LD-20120112 2007656 <dbpedia:Terminology_of_the_Vietnam_War> 0 25.399387 vietnam war facts
1 INEX_LD-20120112 2723675 <dbpedia:Leaders_of_the_Vietnam_War> 1 25.087267 vietnam war facts
2 INEX_LD-20120112 1225602 <dbpedia:List_of_songs_about_the_Vietnam_War> 2 25.045166 vietnam war facts
3 INEX_LD-20120112 2325520 <dbpedia:The_Quicksand_War:_Prelude_to_Vietnam> 3 24.620978 vietnam war facts
4 INEX_LD-20120112 1820977 <dbpedia:Legality_of_the_Vietnam_War> 4 24.605000 vietnam war facts
... ... ... ... ... ... ...
65114 TREC_Entity-17 1871347 <dbpedia:The_Hazel_Scott_Show> 995 18.324102 chefs with a show on the food network
65115 TREC_Entity-17 2361894 <dbpedia:RPM_(TV_series)> 996 18.324102 chefs with a show on the food network
65116 TREC_Entity-17 3073627 <dbpedia:Deutschlands_MeisterKoch> 997 18.320805 chefs with a show on the food network
65117 TREC_Entity-17 525996 <dbpedia:Heinz_Winkler_(chef)> 998 18.308566 chefs with a show on the food network
65118 TREC_Entity-17 1961614 <dbpedia:Matthew_Levin_(chef)> 999 18.308566 chefs with a show on the food network
[65119 rows x 6 columns]
Can you provide more details about the indexing setup that caused the error?
Thanks, sean
We need to see the Java side of the exception. Could you try...
from jnius import JavaException
try:
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
except JavaException as ja:
print('\n\t'.join(ja.stacktrace))
raise ja
Hi @cmacdonald @seanmacavaney , thanks for your reply. I changed the length of docno from 39 to 200, and it worked.
@Jia-py -- it's usually a good idea to take the PyTerrier samples from https://ir-datasets.com/. Especially to handle things like the maximum docno length.
Describe the bug Got the following error when creating index for dbpedia
To Reproduce Steps to reproduce the behavior:
The code I used: