ssbc / ssb-server

The gossip and replication server for Secure Scuttlebutt - a distributed social network
1.68k stars 164 forks source link

Stuck indexing and not responding to incoming connections #734

Open clehner opened 3 years ago

clehner commented 3 years ago

ssb-server as of v16 can get into an unresponsive state. In this state, it rejects incoming connections and does not index the flumelog.

When a secret-handshake client tries to connect to the server - even a local one - the server rejects the connection a few seconds after sending the first response packet. Example transcript from trying to connect to the server:

$ strace sbotc -4 whoami
[...]
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(8008), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getrandom("\316\f%\243\333\220\214\203e\357\312;P\270\211\30", 16, 0) = 16
getrandom("\320\366\27\177:\234]{:\222\245\363Ow\0232\302\315\276\240C\235\344\220\10\365\tj\373\raT", 32, 0) = 32
write(3, "$\213\202\nVI\222wb\354Z\307\\\3*~\257R\376F\23\3710\377\0.J\25\366\322\1\216"..., 64) = 64
read(3, "\205\364n\264\322\17;\342\22\v^\271\310\271@\215\253=\345@\334o\233\320\345}\307#\341OaH"..., 64) = 64
write(3, "\212N\261/\v\301\2061_\36\367\364P\265\314\313\36\201\25\21\341\301\250=;\2p\214m\303\201\20"..., 112) = 112
read(3, [A few seconds passes here...] "", 80)                         = 0
write(2, "sbotc: ", 7sbotc: )                  = 7
write(2, "hello not accepted", 18hello not accepted)      = 18
write(2, ": ", 2: )                       = 2
write(2, "Broken pipe\n", 12Broken pipe
)           = 12

If the server is configured to use the unix noauth socket, sbotc can connect using that. (The -4 option used above forces it to use TCP and thus SHS instead of the noauth socket). Connecting with sbotc using the noauth socket reveals that ssb-server is not indexing its flumelog, as this command shows the progress stuck:

$ while sleep 1; do sbotc progress | jq -c .indexes.target-.indexes.current; done
366939
366939
366939
366939
366939
^C

It looks to me that ssb-server v16 gets into this state when it starts while the flumelog is not caught up. In previous versions, doing this would not be problem. ssb-server should allow appending data to the flumelog while it is offline and then indexing and catching up when it starts. Running ssb-server v15 causes it to catch up indexing, and then switching back to ssb-server v16 works again.

SSB thread reporting this issue: %xPim4b5fwQ+3YDWkJ34WS2dN61gteHuu9UDqCY3Ipxg=.sha256

timjrobinson commented 3 years ago

I encountered this again today in local dev after pulling in a lot of new data. Indexing was stalling at 97% and the server wasn't fully booting and so wasn't responding to whoami.

After updating ssb-db to the latest version 20.4.0 with npm i -s ssb-db@latest it seems to be indexing correctly and responding to commands again.

I did get some new errors in server logs that I hadn't seen before, so maybe they were causing the stall before:

could not retrive msg: Error [NotFoundError]: Key not found in database [@/3q5HcRfJu6KYA+zRmpwPEB5EN5QF8Ia4xhCrCTHtxw=.ed25519,1]
    at /home/tim/projects/ssb-server/node_modules/levelup/lib/levelup.js:188:15
    at /home/tim/projects/ssb-server/node_modules/encoding-down/index.js:75:21
could not retrive msg: Error [NotFoundError]: Key not found in database [@+sPe2KX1gMaYAiQJKTtBNlFT2bhZOHW07G7n1h4As+w=.ed25519,1]
    at /home/tim/projects/ssb-server/node_modules/levelup/lib/levelup.js:188:15
    at /home/tim/projects/ssb-server/node_modules/encoding-down/index.js:75:21