[BUG] Documentation: mgbench setup prerequisites are unclear

jhb commented 1 year ago

Describe the bug I want to reproduce the benchmarks on https://memgraph.com/benchgraph, but I am not sure how either memgraph nor neo4j are correctly to be installed. The only information available is at https://github.com/memgraph/memgraph/tree/master/tests/mgbench#prerequisites, which only tells me to install them in their binary form.

Expected behavior

I would love to know if setting up the databases in their binary form means that I have to compile memgraph myself, e.g. with what options, or if I should use the docker image, and run the tests inside the docker image? Or in another way? Which one?

It would be great if the setup was described for memgraph (and ideally also for neo4j), in a way that allows to be certain that the setup is correct, and that deviations from the published results on https://memgraph.com/benchgraph are not due to the setup.

antejavor commented 1 year ago

Hi @jhb, thanks for reporting the issue. I am aware that prerequisites are not precise and detailed at the moment. After I post more details on how to run the benchmarks on this issue, I will also update the methodology part. Aldo, these things could be done differently, inside docker, by downloading memgraph etc. As you suggested, I will post steps for the setup we used to get mentioned results.

The steps are a bit tiresome, mainly because we used mgBench as an in-house benchmark tool, and it is tightly coupled with Memgraph. We plan to make these steps a bit easier in the future.

Memgraph part:

You will need a Linux OS machine (CentOS, Debian or Ubuntu) to compile a Memgraph and Client
Compile memgraph (for help, look at this short guide [here] (https://memgraph.notion.site/Quick-Start-82a99a85e62a4e3d89f6a9fb6d35626d) take a look at step 3 and 4 before diving in.
Make sure that in the build configuration step, replace cmake .. with cmake -DCMAKE_BUILD_TYPE=Release .., since you want to test performance on a Release version.
Compile the whole project because we are using a custom C++ client found here. You can also compile just memgraph and client, but this is easier. If you want, you can use make -j$(nproc) memgraph for memgraph and make -j$(nproc) memgraph__mgbench__client for the client.
At this point, locate your binary in the build folder with _Release suffix.

Neo4j part:

Download the binary version of the 5.1 community edition on their website for the appropriate OS.
Make sure you have at least JDK 17 on your machine.
Running the benchmark process will disable auth on Neo4j in the config file.

After that, it comes to running the benchmark. Position yourself in the mgbench directory and call

graph_bench.py
--vendor memgraph /home/memgraph/build/binary
--dataset-group basic
--dataset-size small
--realistic 100 30 70 0 0
--realistic 100 50 50 0 0
--realistic 100 70 30 0 0
--realistic 100 30 40 10 20
--mixed 100 30 0 0 0 70

Of course, replace the path with the Memgraph path to the standalone binary. When you decide to run Neo4j, replace a vendor name with Neo4j, and pass a path to the root neo4j project, not the path to the binary. Like this:

graph_bench.py
--vendor neo4j /home/neo4j-5.2
--dataset-group basic
--dataset-size small
--realistic 100 30 70 0 0
--realistic 100 50 50 0 0
--realistic 100 70 30 0 0
--realistic 100 30 40 10 20
--mixed 100 30 0 0 0 70

If you have any additional questions regarding the benchmark or need help, feel free to ask and let me know by email or another issue :D

jhb commented 1 year ago

I am behind a firewall. I did the following:

git clone https://github.com/memgraph/memgraph.
cd memgraph
./environment/os/ubuntu-22.04.sh install TOOLCHAIN_RUN_DEPS
./environment/os/ubuntu-22.04.sh install MEMGRAPH_BUILD_DEPS
wget https://s3-eu-west-1.amazonaws.com/deps.memgraph.io/toolchain-v4/toolchain-v4-binaries-ubuntu-22.04-amd64.tar.gz
tar xzvfm toolchain-v4-binaries-ubuntu-22.04-amd64.tar.gz -C /opt
source /opt/toolchain-v4/activate
./init

which leads to

root@a761b0d59c33:/memgraph# ./init
ALL BUILD PACKAGES: The right operating system.
The right architecture!
git make pkg-config curl wget uuid-dev default-jre-headless libreadline-dev libpython3-dev python3-dev libssl-dev libseccomp-dev netcat python3 python3-virtualenv python3-pip python3-yaml libcurl4-openssl-dev sbcl doxygen graphviz mono-runtime mono-mcs zip unzip default-jdk-headless dotnet-sdk-6.0 golang nodejs npm autoconf libtool
The right operating system.
The right architecture!
All packages are in-place...
2022-12-20 14:29:03 URL:https://beta.quicklisp.org/quicklisp.lisp [57144/57144] -> "quicklisp.lisp" [1]

  ==== quicklisp quickstart 2015-01-28 loaded ====

    To continue with installation, evaluate: (quicklisp-quickstart:install)

    For installation options, evaluate: (quicklisp-quickstart:help)

WARNING: Making quicklisp part of the install pathname directory
Unhandled SB-BSD-SOCKETS:TRY-AGAIN-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                                      {1004BDC173}>:
  Name service error in "getaddrinfo": -3 (Temporary failure in name resolution)

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1004BDC173}>
0: (SB-DEBUG::DEBUGGER-DISABLED-HOOK #<SB-BSD-SOCKETS:TRY-AGAIN-ERROR {1002FB1643}> #<unused argument> :QUIT T)
1: (SB-DEBUG::RUN-HOOK *INVOKE-DEBUGGER-HOOK* #<SB-BSD-SOCKETS:TRY-AGAIN-ERROR {1002FB1643}>)
2: (INVOKE-DEBUGGER #<SB-BSD-SOCKETS:TRY-AGAIN-ERROR {1002FB1643}>)
3: (ERROR SB-BSD-SOCKETS:TRY-AGAIN-ERROR :ERRNO -3 :SYSCALL "getaddrinfo")
4: (SB-BSD-SOCKETS:NAME-SERVICE-ERROR "getaddrinfo" -3)
5: (SB-BSD-SOCKETS:GET-HOST-BY-NAME #<unavailable argument>)
6: ((:METHOD QLQS-NETWORK::%OPEN-CONNECTION (QLQS-IMPL:SBCL T T)) #<QLQS-IMPL:SBCL {1004D78213}> "beta.quicklisp.org" 80) [fast-method]
7: ((:METHOD QLQS-NETWORK::%CALL-WITH-CONNECTION (T T T T)) #<QLQS-IMPL:SBCL {1004D78213}> "beta.quicklisp.org" 80 #<FUNCTION (LAMBDA (QLQS-HTTP::CONNECTION) :IN QLQS-HTTP:FETCH) {1002E058EB}>) [fast-method]
8: (QLQS-HTTP:FETCH #<QLQS-HTTP::URL "http://beta.quicklisp.org/client/quicklisp.sexp"> #P"/root/quicklisp/tmp/fetch.dat" :FOLLOW-REDIRECTS T :QUIETLY NIL :MAXIMUM-REDIRECTS NIL)
9: (QUICKLISP-QUICKSTART::RENAMING-FETCH "http://beta.quicklisp.org/client/quicklisp.sexp" #P"/root/quicklisp/tmp/client-info.sexp")
10: (QUICKLISP-QUICKSTART::FETCH-CLIENT-INFO-PLIST "http://beta.quicklisp.org/client/quicklisp.sexp")
11: (QUICKLISP-QUICKSTART::FETCH-CLIENT-INFO "http://beta.quicklisp.org/client/quicklisp.sexp")
12: (QUICKLISP-QUICKSTART::INITIAL-INSTALL :CLIENT-URL "http://beta.quicklisp.org/client/quicklisp.sexp" :DIST-URL NIL)
13: ((:METHOD QLQS-IMPL-UTIL::%CALL-WITH-QUIET-COMPILATION (T T)) #<QLQS-IMPL:SBCL {1004D78213}> #<FUNCTION (LAMBDA NIL :IN QUICKLISP-QUICKSTART:INSTALL) {1002D736DB}>) [fast-method]
14: ((:METHOD QLQS-IMPL-UTIL::%CALL-WITH-QUIET-COMPILATION :AROUND (QLQS-IMPL:SBCL T)) #<QLQS-IMPL:SBCL {1004D78213}> #<FUNCTION (LAMBDA NIL :IN QUICKLISP-QUICKSTART:INSTALL) {1002D736DB}>) [fast-method]
15: (QUICKLISP-QUICKSTART:INSTALL :PATH "/root/quicklisp" :PROXY NIL :CLIENT-URL NIL :CLIENT-VERSION NIL :DIST-URL NIL :DIST-VERSION NIL)
16: (SB-INT:SIMPLE-EVAL-IN-LEXENV (QUICKLISP-QUICKSTART:INSTALL :PATH "/root/quicklisp") #<NULL-LEXENV>)
17: (EVAL-TLF (QUICKLISP-QUICKSTART:INSTALL :PATH "/root/quicklisp") NIL NIL)
18: ((LABELS SB-FASL::EVAL-FORM :IN SB-INT:LOAD-AS-SOURCE) (QUICKLISP-QUICKSTART:INSTALL :PATH "/root/quicklisp") NIL)
19: (SB-INT:LOAD-AS-SOURCE #<SB-SYS:FD-STREAM for "standard input" {100183A143}> :VERBOSE NIL :PRINT NIL :CONTEXT "loading")
20: ((LABELS SB-FASL::LOAD-STREAM-1 :IN LOAD) #<SB-SYS:FD-STREAM for "standard input" {100183A143}> NIL)
21: (SB-FASL::CALL-WITH-LOAD-BINDINGS #<FUNCTION (LABELS SB-FASL::LOAD-STREAM-1 :IN LOAD) {7FB4B991F80B}> #<SB-SYS:FD-STREAM for "standard input" {100183A143}> NIL #<SB-SYS:FD-STREAM for "standard input" {100183A143}>)
22: (LOAD #<SB-SYS:FD-STREAM for "standard input" {100183A143}> :VERBOSE NIL :PRINT NIL :IF-DOES-NOT-EXIST T :EXTERNAL-FORMAT :DEFAULT)
23: ((FLET SB-IMPL::LOAD-SCRIPT :IN SB-IMPL::PROCESS-SCRIPT) #<SB-SYS:FD-STREAM for "standard input" {100183A143}>)
24: ((FLET SB-UNIX::BODY :IN SB-IMPL::PROCESS-SCRIPT))
25: ((FLET "WITHOUT-INTERRUPTS-BODY-11" :IN SB-IMPL::PROCESS-SCRIPT))
26: (SB-IMPL::PROCESS-SCRIPT T)
27: (SB-IMPL::TOPLEVEL-INIT)
28: ((FLET SB-UNIX::BODY :IN SB-IMPL::START-LISP))
29: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-IMPL::START-LISP))
30: (SB-IMPL::START-LISP)

http_proxy and https_proxy (lower- and uppercase) are correctly set. Any ideas?

antejavor commented 1 year ago

Hi @jhb,

At the point of socket error, quicklisp needs to be fetched via HTTP URLs. Here is my run for reference:

2022-12-22 13:19:32 URL:https://beta.quicklisp.org/quicklisp.lisp [57144/57144] -> "quicklisp.lisp" [1]

  ==== quicklisp quickstart 2015-01-28 loaded ====

    To continue with installation, evaluate: (quicklisp-quickstart:install)

    For installation options, evaluate: (quicklisp-quickstart:help)

WARNING: Making quicklisp part of the install pathname directory
; Fetching #<URL "http://beta.quicklisp.org/client/quicklisp.sexp">
; 0.82KB
==================================================
839 bytes in 0.00 seconds (204.83KB/sec)
; Fetching #<URL "http://beta.quicklisp.org/client/2021-02-13/quicklisp.tar">
; 260.00KB
==================================================
266,240 bytes in 0.04 seconds (6499.84KB/sec)
; Fetching #<URL "http://beta.quicklisp.org/client/2021-02-11/setup.lisp">
; 4.94KB
==================================================
5,057 bytes in 0.00 seconds (1234.62KB/sec)
; Fetching #<URL "http://beta.quicklisp.org/asdf/3.2.1/asdf.lisp">
; 628.18KB
==================================================
643,253 bytes in 0.06 seconds (10469.26KB/sec)
; Fetching #<URL "http://beta.quicklisp.org/dist/quicklisp.txt">
; 0.40KB
==================================================
408 bytes in 0.00 seconds (0.00KB/sec)
Installing dist "quicklisp" version "2022-11-07".
; Fetching #<URL "http://beta.quicklisp.org/dist/quicklisp/2022-11-07/releases.txt">
; 527.32KB
==================================================
539,973 bytes in 0.06 seconds (9416.21KB/sec)
; Fetching #<URL "http://beta.quicklisp.org/dist/quicklisp/2022-11-07/systems.txt">
; 401.22KB
==================================================
410,847 bytes in 0.02 seconds (25074.54KB/sec)

Not 100% sure what is causing the issue here, we didn't have that exact issue, but the firewall could be an obvious reason and the first thing I would consider. Maybe trying to fetch quicklisp outside of the ./init process would be a great way to debug?

We had some issues regarding quicklisp installation on WSL2. Are you running this in the WSL2 container?
If you are running it in WSL2 (assumption), you probably need to set up some manual proxy inside WSL2, take a look at this issue, which is similar here. ./init script has a flag --wsl-quicklisp-proxy, so once you are running the proxy (as explained in the SO answer), you can inject the proxy URL by executing: ./init --wsl-quicklisp-proxy "localhost:8080". This is only required in the init stage if the quicklist doesn’t want to cooperate.

Do you have access to a Linux machine without a firewall to test things? This seems like the easiest way to test these things.

Also, on the issue, tagging @gitbuda (our CTO), we had a short discussion about this, he offers to help with quicklisp and is interested in the use-case, if you have time for a call and chat, he would gladly do it.

jhb commented 1 year ago

Hi @antejavor,

I wish you a happy new year!

Thanks a lot for your advice, I can confirm that the benchmark is running now (literally as I write this). I had to do add --wsl-quicklisp-proxy "1.2.3.4:3128", just as you said. On top of that I had to set the neo4j vendor like this: --vendor neo4j /neo4j. It seems that contrary to the documentation, you need to specifiy the folder where bin/neo4j is located, not the binary itself. Last, the compare_results.py script needed a --difference-threshold=0.02 parameter in order to not raise an exception.

Now I am curious to see what the results are!

antejavor commented 1 year ago

Hi @jhb,

Thank you for your wishes. Wishing you all best as well, hope you have a great year!

Regarding binary file path --vendor neo4j /neo4j , I am aware of that issue. It is mentioned in the instructions comment above, but there was a lot of information, so it probably slipped 😄 . You are correct that it is not consistent in the methodology. I just started to work on mgBench again and will update everything in a few days.

Regarding the --difference-threshold didn't notice the issue so far. I just quickly skimmed through the code and probably forgot to add the default argument value. That is the reason why there is no flag in the methodology. Thanks for letting me know 👍 .

Just a hint regarding benchmarks, not sure on what dataset size you are running it, if you are running on a large dataset, it will take about 30 hours to load everything in Neo4j. Since they are pretty slow when there are a lot of transactions and require some optimizations to spread queries in chunks. For the same reason, we experienced some crashes on a large dataset with Neo4j, and didn't cache time to adapt mgbench to support that scenario where DB crashes, since Memgraph does not have that issues, so my suggestion is to start with the small and medium datasets.

I will probably fix Neo4j support for the large dataset in a few days and keep you posted if anything changes.

jhb commented 1 year ago

Thanks a lot for your advice, @antejavor.

Another discrepancy I found is that the documentation states: "Hot run - before executing any benchmark query and taking measurements, a set of defined queries is executed to pre-warm the database.". If I look at https://github.com/memgraph/memgraph/blob/eda5213d950e4cf4f8c81d6eee8963fa0596ffa6/tests/mgbench/benchmark.py#L648-L650 I get the impression however that the warmup time is measured as well?

While asking that, there are two issues I see with the warmup:

https://github.com/memgraph/memgraph/blob/eda5213d950e4cf4f8c81d6eee8963fa0596ffa6/tests/mgbench/benchmark.py#L237 tells me about the warmup. It seems to me that this line barely warms up the client. Wouldn't it be more suitable to use a MATCH (n) OPTIONAL MATCH (n)-[r]->() RETURN count(n.prop) + count(r.prop); as this actually reads all the data?
regarding neo4j: I notice that indexes are created, but not awaited. Not sure yet if that makes a difference, but wouldn't it be fair to include a call db.awaitIndexes()?

jhb commented 1 year ago

Sorry, I noticed my error, and see what the vendor doesn't track time, only memory usage.

antejavor commented 1 year ago

@jhb Regarding the warm-up, that is true. If you want to warm up the vendor better, it is recommended to execute more queries. For example, touching all the data etc. Feel free to add different queries for warm-up. We have opened the issue on things people reported and wanted to see in future iterations: https://github.com/memgraph/memgraph/issues/689. There are also things you mentioned, one of them is stronger warmups.

On the other hand, we are not making decisions on a few iterations. We are executing hundreds to thousands of iterations per query in some cases. You can increase the running time of the benchmark and number of iterations by --single-threaded-runtime-sec, default 10 seconds. In the .cache folder, you can see how many iterations are actually executed per query:

                "expansion_1": {
                    "count": 35005,
                    "duration": 10
                },

For expansion 1 and Memgraph, we run a total of 35005 queries and made measurements on that sample. So for Neo4j run, increase the flag default value of 10 seconds to a bigger value. This will execute more queries, hence warm-up the system better.

Now that you mentioned the warm-up topic, this one is a bit controversial from multiple views. Neo4j being JVM-based and hybrid on disk/memory requires time to warm up. This requires developer time and an understanding of DB. There are some automatic warm-up procedures reserved for Enterprise users only. So we decided to slightly warm up Neo4j and JVM with just a few queries. It definitely can be warmed up further. The focus of this benchmark is more "out-of-the-box" performance than fine-tuning, and we are definitely not Neo4j experts. There were some recommendations (from ex. Neo4j folks) to execute hole benchmarks for warmup first so we will probably move to that solution in the future, bringing all systems to the optimal state.

I could talk about this in enormous lengths, with pros and cons for each direction, strong warm-ups vs weak warm-ups vs no warm-ups at all. If you take a look at SQL world, Clickhouse recently published their benchmark, they are considering a hot-run a second or third execution of the same query: https://github.com/ClickHouse/ClickBench#results-usage-and-scoreboards. So the first execution of the query is cold, second is hot while we are doing quite a number of iterations. I am mentioning this because the SQL world is more mature and competitive, and these things will get stricter and stricter with time as the graph database scene matures further.

Regarding the indexes, I was not aware of the mentioned await indexes procedure. But, you could also notice that we are creating and index before loading data (this slows the import process). This means after each transaction, index is updated. So, in theory(I didn't test it) index should be ready. Thanks for the hint, I added the notice to the issue, will try it out, since I will move index creation to the end of the process to improve import speeds.

jhb commented 1 year ago

@antejavor Thanks for the explanations. I agree, that this issue is not the right place to discuss all pro and cons of cache warmup, so I just thank you for your elaboration on this.

You mention 'The focus of this benchmark is more "out-of-the-box" performance than fine-tuning'. I also notice code to specifically turn off the edge property store of memgraph during the benchmark: https://github.com/memgraph/memgraph/blob/eda5213d950e4cf4f8c81d6eee8963fa0596ffa6/tests/mgbench/benchmark.py#L543-L549. Doesn't this look a bit like fine-tuning?

antejavor commented 1 year ago

@jhb You are right, it does look like a bit of a fine-tuning, but it is not. It is just legacy code that needs a bit of love and refactoring 😄 . If you take a look at this piece of code where runners flags are prepared: https://github.com/memgraph/memgraph/blob/eda5213d950e4cf4f8c81d6eee8963fa0596ffa6/tests/mgbench/runners.py#L88-L96

We had to handle different flags for properties on the edges on some of the earlier versions of Memgraph. We introduced that flag way back, and it should be removed from the code you are referring to. In methodology, I mentioned that this code is tightly coupled with Memgraph. This is one of the reasons I will refactor this a bit to be more transparent and well-designed. The other reason is to be easier to execute.

What is important here in the context of benchmarks, when you run the benchmarks and do not pass the flag, it is false by default. But in the current versions of Memgraph, the default value for --storage-properties-on-edges is true. This is why I just added the negation there. You can take a look at default Memgraph configurations via SHOW CONFIG command or by accessing the memgraph config file at /etc/memgraph/memgraph.conf(docker). Take a look at this: https://memgraph.com/docs/memgraph/reference-guide/configuration. So it looks like a fine-tuning, but it is an out-of-the-box configuration.

By the way, did you manage to run benchmarks? What are the results? I am excited to find out 😄

jhb commented 1 year ago

Hi @antejavor,

I appreciate you agreeing on the look of things, and thank you all the more for your clarification on this issue.

The benchmarks were running already last week, see above. My impression is that the throughput in the given benchmarks is indeed better for memgraph. The comparison of memory consumption is less relevant for me, as it is fixed in neo4j anyhow. What I find suprising is that in my own benchmarks (using the neo4j python driver) with my own dataset the result is quite the opposite. So I spend a bit of time finding out why the results are so different...

antejavor commented 1 year ago

Hmm, that is interesting, I would love to find out what is happening, so if it not ultra private, please share all the information you have about this 😅 . How much different? Did you measure the base latency for your queries? Or did you also measure throughput? Sorry a lot of questions 😅 .

Not sure what you are measuring, just giving the idea about clients, maybe trying to execute the queries in proprietary CLI tools will give a sense about latency. We did this to validate the latency results of mgbench, since mgclient supports both Neo4j and Memgraph.

Also returning just counts of results, not actual properties and nodes (high I/O load on client) could yield different results. Since big I/O bandwidth can stress test the client and protocol not actual DB performance. This depend how you measure latency.

Of course in the end it could all be to specific queries and dataset, in this case. This is what makes benchmarking hard and interesting topic.

jhb commented 1 year ago

Thanks a lot for your advice. I don't think I can share the details of the test setup here, I am afraid. I contacted you on discord though. Having said that, I use the python bolt driver with muliprocess.Pool and starmap to do simultaneous queries. As I am interested in relative performance, I compare the same set of queries (or similar ones, if they use built in algorithms) against multiple graph databases (and versions thereof). Some queries can 10s of seconds to run, so I am more interested in overall time taken.

antejavor commented 1 year ago

@jhb Thanks for all the feedback on the benchmarks in general. This issue was resolved a few months ago, so I will close it. We have also built a tool for that in the mean time. https://memgraph.com/blog/benchmark-memgraph-or-neo4j-with-benchgraph.

memgraph / memgraph

[BUG] Documentation: mgbench setup prerequisites are unclear #714