hoytech / strfry

a nostr relay
GNU General Public License v3.0
501 stars 98 forks source link

Large database performance #57

Open kroese opened 1 year ago

kroese commented 1 year ago

I noticed that as my database grew, the time between:

CONFIG: successfully installed

and

Filter matched XXX records

seemed to increase exponentially. When I had about 5 million events, it took a few minutes and now that I have 15 million events it takes hours (even though it doesnt max out the CPU). This makes it impossible to sync anymore.

By looking at the code my guess is that its stuck at this loop:

while (1) {
            bool complete = query.process(txn, [&](const auto &sub, uint64_t levId, std::string_view eventPayload){
                auto ev = lookupEventByLevId(txn, levId);
                ne.addItem(ev.flat_nested()->created_at(), sv(ev.flat_nested()->id()).substr(0, ne.idSize));

                numEvents++;
            });

            if (complete) break;
        }

At first sight its just adding events to a list, so it should not take exponentially more time? Is there anything that can be done to optimize this part?

hoytech commented 1 year ago

Hi! I think most likely the reason for this is disk I/O. It's probably not exponential, but after it exceeds what can be held in the filesystem cache it causes more disk activity.

I wonder if resizing the vector is also causing some overhead -- I will try to benchmark this.

In the long-term, I'm planning a cache for the XOR fingerprints which will make this much better.

kroese commented 1 year ago

The size of the DB (32GB) is roughly twice the size of the RAM available as a file cache (16GB). The disk is a SSD so even cache-misses should be handled pretty quick, but you may be right that this problem started around the same time the DB did not fit fully in RAM anymore.

hoytech commented 1 year ago

Thanks for the info. FWIW I made a simple benchmark to see if the Negentropy object creation is scaling non-linearly. It appears to be linear, so I think it's likely IO related. I'll think on this.

For reference, here was my benchmark:

#include <unistd.h>

#include <iostream>
#include <sstream>

#include <hoytech/error.h>
#include <hoytech/hex.h>

#include "Negentropy.h"

int main() {
    const uint64_t idSize = 16;
    Negentropy x1(idSize);

    uint64_t num = std::stoull(::getenv("NUM"));
    uint64_t now = ::time(nullptr);

    char id[idSize] = {0};

    for (uint64_t i = 0; i < num; i++) {
        x1.addItem(now--, std::string_view(id, idSize));
    }

    x1.seal();

    return 0;
}
kroese commented 1 year ago

What I still dont fully understand, is that even to read that full 32GB from disk, it should not take any longer than 10 minutes at most. Maybe the LMDB engine reads the records in small chunks, so lets make it 60 minutes.

Thats still way below the actual time the sync-command stalls. I started it 20 hours ago, and its still stuck there. So either it does disk IO extremely inefficiently, or its related to something else.

kroese commented 1 year ago

I gathered some graphs:

diskstats_iops-day diskstats_throughput-day diskstats_utilization-day

It seems to confirm your suspicion that is IO related. It reads 250 MB/s for 15 hours, so somehow its reading MUCH more data then just the whole database.

hoytech commented 1 year ago

Thanks for these! Something definitely seems off here -- I'm investigating.

kroese commented 1 year ago

I saw that you made some optimisations recently, but I didnt notice any large difference after using them.

And today I discovered my problem is not exclusive to syncing at all. As soon as my strfry gets any real traffic (more than a couple of simultanous users) the IO usage immediately goes to 100 procent and slows down the whole server.

I have the most expensive SSD option for Oracle Cloud (Ultra-High Performance) rated for over 50,000 IOPS, so it's very unlikely the reason is my disk is too slow. I have 24 GB of RAM available, so that should be enough, and CPU usage is below 20 percent. This server should be completely overkill to handle just a handful of nostr clients.

As a last resort, I started with an empty database, and the problem disappeared completely. So this confirms the only reason for my problems is the size of the database (60GB).

Do you have any explanation why other strfry dont run into this problem? Do instances like nos.lol or spore.ws have a much smaller database? Or is it because they are running on X86 and my server is ARM? I really would like to understand why I have these performance problems.

hoytech commented 1 year ago

Hi! Yes, I made some optimisations. They will be most effective if the DB is compacted. I didn't want to ask you to try it because there are still a few more things I want to do.

I'm not sure if x86 versus ARM has anything to do with it. It could! I have not done any significant testing on ARM before. AFAIK most major relays are x86. One possible difference is that on ARM the page sizes can be 64k instead of 4k. What does grep -ir pagesize /proc/self/smaps | head -2 return for you BTW?

For the next release I'm planning on adding a flag that lets you disable read-ahead, which should reduce IO usage, but will increase the latency of various other operations.

kroese commented 1 year ago

grep -ir pagesize /proc/self/smaps | head -2 returns:

KernelPageSize:        4 kB
MMUPageSize:           4 kB

So I guess that is normal?

hoytech commented 1 year ago

Yeah, that looks like what I would expect.

I just pushed a commit that adds a new config parameter: dbParams.noReadAhead. Can you please recompile and set that to true and see if it makes any difference? Thanks!

kroese commented 1 year ago

I appreciate the effort you took, but unfortunelately Im not seeing any significant difference in I/O usage when noReadAhead is enabled.

Im really trying to wrap my head around what strfry is doing under the hood. For example, I also run a Bitcoin node on the same server. Its database (LevelDB) is much larger, and it has many more clients connected who are constantly fetching data, and still it is only using a tiny fraction of the I/O strfry is using.

Im not sure if that is a fair comparison, but from a distance their purpose is almost the same (sync data from peers and relay it to all connected clients).

I never compacted the database, since I currently do not have enough free space left to do that (as it would require two copies of the DB). But I assume it should not matter much as with ReadAhead disabled it would not try to read the empty records?

Also Im seeing the sending error to [x]: too many concurrent REQs message a lot in my logs. In the config-file this limit is set to 20 (the default). Is this normal that clients reach this limit? Because maybe its the mis-baviour of some clients that cause the heavy load?

hoytech commented 1 year ago

I think you're right in that bitcoin nodes and nostr relays aren't directly comparable. In nostr you've got clients constantly making fairly relatively random queries that can retrieve data from any part of your DB, whereas with bitcoin most of the data transferred are gossiping new transactions/blocks, and most of the IO would be over the UTXO set.

I wasn't aware that you hadn't done a compaction at all. This has the potential to help a lot. If you don't have the diskspace, one thing you can do is compress in a pipe, so: strfry compact - | zstd -c > strfry-backup.zst . Then after that is done, delete the data.mdb and uncompress the backup and move it to where the data.mdb file was.

If you don't have the diskspace to store a compressed version, you could stream the compaction to another host.

Several other people have reported the "too many concurrent REQs" thing recently. I think some new client is opening too many. We're trying to track it down now.

markwaters commented 8 months ago

I run a couple of private relays , and see many "too many concurrent REQs" in the logs when using my local web-based Primal site.

https://github.com/PrimalHQ/primal-web-app