mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 991 forks source link

Node out of sync frequently #2786

Closed garyyu closed 5 years ago

garyyu commented 5 years ago

With master branch at 23th Apr. (commit 97e96e4f), on a seed node with more than 200 peers connection, I find this in this morning:

$ grep "enabling sync"  main/grin-server.log

20190429 17:00:32.520 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 452077136850862, peer_difficulty 452091465393996, threshold 14216566283 (last 5 blocks), enabling sync
20190429 18:55:28.516 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 452425978492995, peer_difficulty 452440916923808, threshold 14816774959 (last 5 blocks), enabling sync
20190429 19:30:17.893 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 452539956776890, peer_difficulty 452554856641795, threshold 14698209344 (last 5 blocks), enabling sync
20190429 21:32:44.192 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 452944243970363, peer_difficulty 452963462027288, threshold 16439366430 (last 5 blocks), enabling sync
20190429 21:40:22.533 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 452973073359731, peer_difficulty 452989148302503, threshold 16018022976 (last 5 blocks), enabling sync
20190429 21:50:51.121 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453002213646712, peer_difficulty 453018648620875, threshold 16329116030 (last 5 blocks), enabling sync
20190429 22:17:30.115 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453083245253858, peer_difficulty 453102420009133, threshold 16043282047 (last 5 blocks), enabling sync
20190429 22:26:28.371 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453111749457897, peer_difficulty 453130491604398, threshold 15683467841 (last 5 blocks), enabling sync
20190429 23:08:29.699 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453219974970666, peer_difficulty 453238408112952, threshold 15972036303 (last 5 blocks), enabling sync
20190429 23:22:28.324 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453271370110331, peer_difficulty 453289078297348, threshold 14869717408 (last 5 blocks), enabling sync
20190430 00:11:39.285 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453392935125941, peer_difficulty 453406783695913, threshold 13833229614 (last 5 blocks), enabling sync
20190430 00:16:37.162 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453415232888559, peer_difficulty 453429385574964, threshold 14007584944 (last 5 blocks), enabling sync
20190430 00:58:57.325 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453514279684477, peer_difficulty 453530622030682, threshold 13208193770 (last 5 blocks), enabling sync
20190430 01:03:44.502 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 453530622030682, peer_difficulty 453546921335334, threshold 13641733361 (last 5 blocks), enabling sync

And the peers dropped to 12! then I found the 3414 port is not in listening, and found the related panic here:

20190429 12:37:49.002 DEBUG grin_p2p::peer - accept: handshaking from Ok(V4(185.186.56.76:43262))
20190429 12:37:49.003 ERROR grin_util::logger -
thread 'p2p-server' panicked at 'clone conn for writer failed: Os { code: 24, kind: Other, message: "Too many open files" }': src/libcore/result.rs:997stack backtrace:
   0: <no info> (0x55ef833eafc0)
   1: <no info> (0x55ef8350eb49)
   ...

And check the server configuation:

$ sysctl net.core.somaxconn
net.core.somaxconn = 4096

$ ulimit -n
1024

So it could be caused by a connection number overload (>1024).

But definitely we need an improvement to avoid this thread crash!

mcdallas commented 5 years ago

This happened to me many times, after having 200+ peers for a while I was getting the Too many open files error and grin stopped listening to 3414. I fixed it by increasing ulimit -n

jeromemorignot commented 5 years ago

I'm using the stratum to mine directly to it with NH, and thus the number of workers can be become really big....and I'm experiencing the same issue. Is there a way to clean up all these dead connections? (I increased the ulimit, but this is a bandaid solution)

bladedoyle commented 5 years ago

FYI: This continues to be an issue even in v1.0.3.

quentinlesceller commented 5 years ago

V1 issue. Feel free to open another issue if it is still the case for v2.