Can't parse wikipedia articles anymore

Looks like benchmark was changed in a way it probably supports /home/paul/git/search-index-benchmark-game/corpus.json whatever format that is, but no longer supports wikipedia's articles. For example using lucene-8.0.0 engine and attempt to index reveals:

$ make idx
---- Indexing Lucene ----
java -server -cp build/libs/search-index-benchmark-game-lucene-1.0-SNAPSHOT-all.jar BuildIndex idx < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
Exception in thread "main" java.lang.NullPointerException
        at BuildIndex.main(BuildIndex.java:39)
Makefile:17: recipe for target 'idx' failed
make: *** [idx] Error 1

which means parse error or better can't get id from the json line. The problem is in wikipedia articles there is no id, but rather url, title and body.

Very similar result is obtained also while testing tantivy-0.9 engine:

$ make index

---- Indexing tantivy ----
export RUST_LOG=info && target/release/build_index "idx" < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgument("Failed to parse document NoSuchFieldInSchema(\"body\")")', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Makefile:19: recipe for target 'idx' failed
make: *** [idx] Error 101

again, the code expects just id and text json text fields...

quickwit-oss / search-benchmark-game

Can't parse wikipedia articles anymore #12