Closed hierophantos closed 1 year ago
Thanks for trying.
It should be obvious that the benchmark code was depending on the parent project's source code, as we test our latest changes often with these benchmarks. For the benchmark to work, one needed to compile Datalevin at least once to build all the classes first, e.g.
cd ..
lein test
Apparently, I should not have made the assumption that people know this, so I changed the dependency to use the released Datalevin library instead. Please pull and try again.
Thank you for your thoughtful and prompt response!
As I'm walking through this document as my first-go into the project, I'm finding out with fresh eyes according to the documentation, which I'm enjoying how the writing leaves me with a greater sense of clarity.
I'm also noticing a slight typo in https://github.com/juji-io/datalevin/tree/master/search-bench#test-data; where the output path should read data/wiki.json
as follows:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wikiextractor -o - --json --no-templates enwiki-latest-pages-articles.xml.bz2 |
jq -s '.[] | select((.text | length) > 500) | {url, text}' > data/wiki.json
Glad to bring greater clarity and precision.
Also, maybe another obvious answer to those more experienced: Is there a way to saturate all or more of the cores on my machine for the search? I'm currently only seeing one core utilized.
Thanks. Datalevin is single writer, so it uses only one core when writing/indexing. However, you can saturate all cores when reading/searching.
And also needs to read data/queries40k.txt
in two places here here: https://github.com/juji-io/datalevin/tree/master/search-bench#test-queries.
wget https://trec.nist.gov/data/million.query/09/09.mq.topics.20001-60000.gz
gzip -d 09.mq.topics.20001-60000.gz
mv 09.mq.topics.20001-60000 data/queries40k.txt
sed -i -e 's/\([0-9]\+\)\:[0-9]\://g' data/queries40k.txt
Got another error 57m24s into the process 🤦♂️.
If you have built the index, you don't have to redo it again, just comment out the line that builds the index.
I was also getting errors running sed
on 09.mq.topics.20001-60000
due to encodings that it didn't know how to read. I ended up using iconv
to convert it to UTF-8 and then piped it to awk
instead (because I found the syntax of awk
less cumbersome compared to all the escape characters needed for sed
, and awk produced an intermediate result that I could use to troubleshoot the error; also sed
was complaining about file not existing during this process, whereas I could spit the results using awk
. 🤷
from this, my https://github.com/juji-io/datalevin/tree/master/search-bench#test-queries reads:
wget https://trec.nist.gov/data/million.query/09/09.mq.topics.20001-60000.gz
gzip -d 09.mq.topics.20001-60000.gz
iconv -f ISO-8859-1 -t UTF-8 09.mq.topics.20001-60000 |
awk '{gsub(/[0-9]+:[0-9]:/,"")}1' > data/queries40k.txt
Also, I noticed https://github.com/juji-io/datalevin/tree/master/search-bench#test-data needs a mkdir data
to be complete:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
mkdir data
wikiextractor -o - --json --no-templates enwiki-latest-pages-articles.xml.bz2 |
jq -s '.[] | select((.text | length) > 500) | {url, text}' > data/wiki.json
Anyways, I have the query results working now. Lookin' good! Thanks for all the feedback. 🙏
Not sure if you'd want a PR wrapping datalevin.bench/index-wiki-json
to check if the relevant file exists already and skipping, or consider that yourself?
Sure thing. PR is welcome. Thanks.
Following the search-bench readme, I successfully got all the test data downloaded and processed, but running the script produces the following error.
How would I address this in general?