nnistelrooij / Information-Retrieval

Repository for research project of the 2019 Information Retrieval course at Radboud University in Nijmegen.
0 stars 0 forks source link

Pyserini #4

Closed meszlili96 closed 4 years ago

meszlili96 commented 4 years ago

Extract the data from the Lucene indexes and build the tables.

meszlili96 commented 4 years ago

Pyserini does not work well for our project, so I tried using Chris's project. I ran into an error when running the application. Something is wrong with the indexing but Pyserini and Anserini seem to be able to work with it. I used exactly the same indexing as in Pyserini and the Anserini demo uses that too. The error is the following:

Exception in thread "main" java.lang.RuntimeException: There should be only one leaf, index the collection using one writer
    at nl.ru.convert.Convert.<init>(Convert.java:45)
    at nl.ru.convert.Convert.main(Convert.java:192)

This originates from the following line, which is exactly the same as in the Anserini code:

reader = DirectoryReader.open(FSDirectory.open(indexPath));

Pyserini uses the SimpleSearcher class from the Anserini project: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SimpleSearcher.java

In the Anserini demo SearchCollection is used: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/SearchCollection.java

Chris's code: https://github.com/Chriskamphuis/olddog/blob/master/src/main/java/nl/ru/convert/Convert.java

nnistelrooij commented 4 years ago

The dict, docs, terms, and qrels tables can now be downloaded from here. These are based upon the index with one leaf, so most problems should now be fixed.