scsibug / nostr-rs-relay

Mirror of https://sr.ht/~gheartsfield/nostr-rs-relay/
MIT License
588 stars 161 forks source link

How to stop the service? #38

Open Giszmo opened 1 year ago

Giszmo commented 1 year ago

I'm running the relay in a tmux and noticed it was using up all my 8 vCPU and the nostr.db-wal was 18GB compared to 4GB of the nostr.db ... I wanted to see what happens if I stop the relay and hit CTRL-C. It closed all connections and supposedly the DB but the process never finished and all the vCPU are still busy running ./nostr-rs-relay --db /usr/src/app/db.

Now ~6~ ~11~ 30min later it even complains about some events. Does it run on a many minutes backlog? How long is the backlog??

Dec 26 23:45:42.785  INFO nostr_rs_relay::server: Close connection down due to shutdown, client: e1e16dbd, ip: "172.17.0.1", connected: 1360.588062127s
Dec 26 23:45:42.786  INFO nostr_rs_relay::server: stopping client connection (cid: e1e16dbd, ip: "172.17.0.1", sent: 0 events, recv: 0 events, connected: 1360.588289512s)
Dec 26 23:45:43.745  INFO nostr_rs_relay::db: database connection closed

^C

Dec 26 23:51:02.721  INFO nostr_rs_relay::db: Could not parse hex range from id "note1kka4yztq697w6pj97we7hudynp6xzkumgdr87hga2s2fkv903nxsxp7g6r"
Giszmo commented 1 year ago

So if there is a backlog, it should probably show estimates on how long that backlog is getting. What are the consequences of hard killing the process now?

hugomd commented 1 year ago

Ctrl - C will send a SIGINT. I'm not sure if nostr-rs-relay respects that.

You could try getting the process ID and sending a SIGTERM instead.

hugomd commented 1 year ago

Although.. The code does seem to handle SIGINT as well: https://github.com/scsibug/nostr-rs-relay/blob/5f9fe1ce59a20c1818c3d9afbc60313ac9cb51cb/src/server.rs#L196-L199

Giszmo commented 1 year ago

It does something. The log above shows that the database connection was closed. It also shut down all websockets and I'm pretty sure I did close it this way before.

Giszmo commented 1 year ago

I see the ^C right where the SIGINT is logged out:

^CDec 26 23:45:42.765  INFO nostr_rs_relay::server: shutting down due to SIGINT (main)                                
Dec 26 23:45:42.765  INFO nostr_rs_relay::server: Shutting down webserver due to SIGINT                               
Dec 26 23:45:42.765  INFO nostr_rs_relay::server: Close connection down due to shutdown, client: f931c73e, ip: "172.17
.0.1", connected: 67565.52392292s                                                                                     
Dec 26 23:45:42.765  INFO nostr_rs_relay::server: Close connection down due to shutdown, client: ee4c940e, ip: "172.17
.0.1", connected: 2559.958913493s                                                                                     
Dec 26 23:45:42.765  INFO nostr_rs_relay::server: stopping client connection (cid: ee4c940e, ip: "172.17.0.1", sent: 0
 events, recv: 0 events, connected: 2559.958949945s)                                                                  
Dec 26 23:45:42.765  INFO nostr_rs_relay::server: stopping client connection (cid: f931c73e, ip: "172.17.0.1", sent: 2
1 events, recv: 355119 events, connected: 67565.523955464s)                                                           
Dec 26 23:45:42.766  INFO nostr_rs_relay::server: Close connection down due to shutdown, client: 41970558, ip: "172.17
.0.1", connected: 61581.323059555s                                                                                    
Dec 26 23:45:42.766  INFO nostr_rs_relay::server: stopping client connection (cid: 41970558, ip: "172.17.0.1", sent: 0
 events, recv: 450 events, connected: 61581.323095729s)                                                               
Dec 26 23:45:42.766  INFO nostr_rs_relay::server: Close connection down due to shutdown, client: ce259bc1, ip: "172.17
.0.1", connected: 4705.007252513s                                                                                     
hugomd commented 1 year ago

I wonder if some threads aren't respecting shutdown events properly..

I'm running nostr-rs-relay in docker and haven't had any issues force stopping the container.

Giszmo commented 1 year ago

So what's concerning me is that the Load went up after SIGINT.

Screenshot from 2022-12-26 21-46-12 Screenshot from 2022-12-26 21-45-42

Giszmo commented 1 year ago

SIGTERM doesn't impress the process. docker kill nostr-relay did.

Giszmo commented 1 year ago

yeah, now the DB is locked :(

docker run -it --rm -p 7000:8080   --user=100:100   -v $(pwd)/data:/usr/src/app/db:Z   -v $(pwd)/config.toml:/usr/src/app/config.toml:ro,Z   --name nostr-relay nostr-rs-relay:latest
Dec 27 00:51:33.028  INFO nostr_rs_relay: Starting up from main
Dec 27 00:51:33.046  INFO nostr_rs_relay::server: listening on: 0.0.0.0:8080
Dec 27 00:51:33.049  INFO nostr_rs_relay::server: db writer created
Dec 27 00:51:33.050  INFO nostr_rs_relay::server: control message listener started
Dec 27 00:51:38.060 ERROR r2d2: database is locked    
Dec 27 00:51:38.060 ERROR r2d2: database is locked    
Dec 27 00:51:38.060 ERROR r2d2: database is locked    
Dec 27 00:51:43.067 ERROR r2d2: database is locked    
Dec 27 00:51:43.467 ERROR r2d2: database is locked    
Giszmo commented 1 year ago

:( how to recover?

Giszmo commented 1 year ago
ls -lh data/
total 22G
-rw-r--r-- 1 systemd-timesync users 3.9G Dec 26 23:45 nostr.db
-rw-r--r-- 1 systemd-timesync users  23M Dec 27 00:52 nostr.db-shm
-rw-r--r-- 1 systemd-timesync users  18G Dec 26 23:45 nostr.db-wal
0xtrr commented 1 year ago

Recover: I locked my DB as well when I ran out of diskspace and had to dump the data from the locked DB into a new sqlite db file to get it to work again.

WAL file: You can run "PRAGMA wal_checkpoint" (or "PRAGMA wal_checkpoint(truncate)") to reduce the size of the wal file. Nostr-rs-relay should do the truncate command automatically once every 60 seconds I believe though.

Giszmo commented 1 year ago

So DB lock recovery was to sqlite3 data/nostr.db ".backup backup.db"; mv backup.db data/nostr.db.

With version 0.7.15 the -wal file doesn't grow big anymore but I still get the weird behaviour that I can't stop the service without corrupting the DB. The DB doesn't end up locked always but it does happen.

scsibug commented 1 year ago

This is strange. Can you provide details about OS version? I want to spin up a VM and try to duplicate.

I've Ctrl-c'd the relay on my OSX and Fedora machine hundreds of times and I've never seen a database lock error even once.

Giszmo commented 1 year ago
# uname -a
Linux lw10 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

digitalocean

Screenshot from 2022-12-28 01-16-02

jblachly commented 1 year ago

SIGTERM doesn't impress the process. docker kill nostr-relay did.

I believe this is a Docker-specific issue. Namely, PID 1 (which the process is, in the container namespace) does not respect SIGKILL or SIGTERM

guggero commented 1 year ago

Yes, this probably is Docker related. It depends on how you invoke a command within a Dockerfile. Seems like Umbrel added a workaround for this: https://github.com/getumbrel/docker-nostr-rs-relay/blob/master/Dockerfile#L27 (ENTRYPOINT is handled differently than CMD and using an array instead of a string command also has an impact: https://testdriven.io/tips/9c233ead-5f34-4b95-b416-84f80f3ef5e7/)