quickwit-oss / search-benchmark-game

Search engine benchmark (Tantivy, Lucene, PISA, ...)
https://tantivy-search.github.io/bench/
MIT License
78 stars 36 forks source link

Can't parse wikipedia articles anymore #12

Closed kgardas closed 4 years ago

kgardas commented 5 years ago

Looks like benchmark was changed in a way it probably supports /home/paul/git/search-index-benchmark-game/corpus.json whatever format that is, but no longer supports wikipedia's articles. For example using lucene-8.0.0 engine and attempt to index reveals:

$ make idx
---- Indexing Lucene ----
java -server -cp build/libs/search-index-benchmark-game-lucene-1.0-SNAPSHOT-all.jar BuildIndex idx < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
Exception in thread "main" java.lang.NullPointerException
        at BuildIndex.main(BuildIndex.java:39)
Makefile:17: recipe for target 'idx' failed
make: *** [idx] Error 1

which means parse error or better can't get id from the json line. The problem is in wikipedia articles there is no id, but rather url, title and body.

Very similar result is obtained also while testing tantivy-0.9 engine:

$ make index

---- Indexing tantivy ----
export RUST_LOG=info && target/release/build_index "idx" < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgument("Failed to parse document NoSuchFieldInSchema(\"body\")")', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Makefile:19: recipe for target 'idx' failed
make: *** [idx] Error 101

again, the code expects just id and text json text fields...

fulmicoton commented 4 years ago

I think this was fixed