PostgreSQL: how to maintain nostr DB?

jatm80 commented 1 year ago

Hi there,
I am using PostgreSQL for my public relay , and i am noticing that after couple of weeks in operation, the DB size is already over 700MB, is there any recommendation about what tables could be clean up and how to do overall maintenance to this DB in the long term?

Thanks

mikeziri commented 1 year ago

I guess it is an inevitability due to the amount of notes your relay receives.

More importantly I would love to have a spam filter in front of it. I guess one can build one after it, meaning it will cleanup spam afterwards inserting.

Mine is on 1.5GB (350MB gziped) for 2 months of work 24/7 with many Chinese repeated notes every minute. Clearly bots spamming.

scsibug commented 1 year ago

If you are running a public relay, it is highly recommended to implement some sort of spam filtering. Your choices with nostr-rs-relay are essentially the following:

Use the verified_users configuration to enforce NIP-05 validation and domain whitelists. Very effective, but limits the audience somewhat.
Implement a gRPC server to filter out spam. Today, this is mostly custom work, but more people are starting to publish their spam detection models.
Sit around and block IPs or networks that connect and send spam, and then clean stuff up afterwards (not recommended!)

Using the gRPC filter will take care of the spam before it hits the database. That is how I run the nostr-pub.wellorder.net, and it blocks hundreds of thousands of events from hitting the DB.

There is a MR currently open for LN pay-to-relay functionality, so that will become a powerful spam-preventing mechanism as well.

mikeziri commented 1 year ago

I see. Could you share your setup with gRPC filter?

scsibug commented 1 year ago

I see. Could you share your setup with gRPC filter?

Sorry, right now it is pretty crude, and relies on an out-of-band spam model that I built, so I don't think it's in a reusable state. If I get some time I'll clean it up and post it, but that'll take a few weeks before I can get around to it. Will try to do it sooner, but no promises.

It's all based on the example code, all I've done is add a bunch of detection functions that operate serially to find indicators of spam/ham, and then sum them together and see if they pass a threshold. Incorporating the bayesian spam model is the most complex bit, and the most fragile. Not sure there are really any high-quality bayes Rust crates, so the one i'm using is pretty slow. Also, naive bayes sucks for asian languages, so that whole approach needs to be rethought...

mikeziri commented 1 year ago

OK. Tnx @scsibug. Hope you can have some time to work on that.

I started playing with this program bogofilter as you can interact with it as an extension or simple cli. It was built for email but works well with plain text on the stdin. My idea was to post filter events. I'd like to have a column filled on the kind 1 (notes) event with spam_score (0-1) and then filter it. I guess, losing false positives won't be cool so it's better to have everything and then dry-run cleanups and commit cleanups.

Also, naive bayes sucks for asian languages, so that whole approach needs to be rethought...

Never thought of it. Not familiar at all with the linguistic structure of asian languages to apply classic bayesian models on (western language) "words". Most spam I'm receiving is clearly asian.

I guess it must be dealt on IP level as you also suggested.

Anyway, if you want to take a look at bogofilter: Repo: https://gitlab.com/bogofilter/bogofilter/-/blob/main/bogofilter/README?plain=1 Website: https://bogofilter.sourceforge.io/ Man page: https://bogofilter.sourceforge.io/man_page.shtml

jatm80 commented 1 year ago

If you are running a public relay, it is highly recommended to implement some sort of spam filtering. Your choices with nostr-rs-relay are essentially the following:
* Use the `verified_users` configuration to enforce NIP-05 validation and domain whitelists.  Very effective, but limits the audience somewhat.

* Implement a gRPC server to filter out spam.  Today, this is mostly custom work, but more people are starting to publish their spam detection models.

* Sit around and block IPs or networks that connect and send spam, and then clean stuff up afterwards (not recommended!)
Using the gRPC filter will take care of the spam before it hits the database. That is how I run the nostr-pub.wellorder.net, and it blocks hundreds of thousands of events from hitting the DB.

There is a MR currently open for LN pay-to-relay functionality, so that will become a powerful spam-preventing mechanism as well.

awesome, thanks for your suggestions, i jump into investigating on the gRPC server using golang (my fav lang), and came across this project pluja/nerostr, it was done for monero, but i think it could be easily adapted for filtering based on some algorithm such as checking the number of followers or follows of the poster for example, if 0 and 0, then it is a bot and block. My intention is not to monetize this but to filter spam while keeping my node as public and free.

scsibug / nostr-rs-relay

PostgreSQL: how to maintain nostr DB? #90