libbitcoin / libbitcoin-server

Bitcoin Full Node and Query Server
Other
151 stars 65 forks source link

Delays during initial syncing #73

Closed atweiden closed 9 years ago

atweiden commented 9 years ago

During initial sync with bitcoin-core and btcd, temporary delays are common, but with libbitcoin-server delays seem more frequent and longer lasting.

During initial sync, libbitcoin-server will occasionally idle (while still claiming 10% CPU), but without syncing any blocks for many minutes. After 30 or so mins have passed with no syncing, I usually just restart bs. Restarting bs has always resulted in immediate syncing of blocks in these cases.

I have an strace of this delay occuring where I waited it out with no restart.

evoskuil commented 9 years ago

These symptoms duplicate issue: https://github.com/libbitcoin/libbitcoin-server/issues/67

The stack trace may help with that issue.

atweiden commented 9 years ago

strace: http://ix.io/ixt (ctrl-f 16:53:17.795822)

can get more data, but logrotate will rm the logfile at this rate within a day

evoskuil commented 9 years ago

Let's move this over to the existing issue.

evoskuil commented 9 years ago

On second thought, let's keep this here and keep it open.

atweiden commented 9 years ago

Here is a bigger strace of this issue happening. Caused delays in sync starting around block 167198 at 16:53:17.795822.

evoskuil commented 9 years ago

The working theory (from @swansontec) on the outstanding issue is that (since 0.10.0) we are having a hard time getting responses from bitcoind nodes, possibly due to activation of DoS protection.

Though this is not code I'm very familiar with, I don't see any particular problem in the trace. It just appears to show periodic lack of data while we are reading sockets. But that's to be expected if we aren't getting responses.

It's possible that the sync flooding issue (https://github.com/libbitcoin/libbitcoin-server/issues/67) may be triggering bitcoind DoS protection, though that's just speculation. Nobody has taken a close look at this on the sire yet. It is clear however that the issue really ramped up with the release of Bitcoin 0.10.0.

evoskuil commented 9 years ago

This is the only bug we have in the entire stack, so it needs to get some attention soon.

evoskuil commented 9 years ago

See: https://lists.dyne.org/lurker/message/20150429.225334.c39629fd.en.html

atweiden commented 9 years ago

In the interim, would hosting a bootstrap of libbitcoin-server's built db on a seedbox be of service? I assume libbitcoin-server cannot reuse bitcoin-core's torrent bootstrap.

evoskuil commented 9 years ago

Correct assumption. Yes, this might be helpful, as long as the gap from upload to sync update wasn't too great to overcome.

It would certainly be nice once this issue is resolved in any case.

swansontec commented 9 years ago

I am doing a full sync right now. Based on the current progress and experience with past syncs, I generally expect the process to take 1-2 weeks. The major problems are:

  1. The server runs out of memory and gets itself killed. I generally solve this by running the server inside a shell while loop. I don't know if this is a bug, or if the box just doesn't have enough RAM.
  2. The server locks up at 100% CPU usage. I've seen this a number of times during the current sync, and it almost always accompanies some messages about network errors. I solve this by running a while loop in another terminal which does a killall -9 bs every 20 minutes.
  3. Blocks randomly stop arriving. CPU and hard-disk usage are at low levels, and the server has plenty of peer connections, but nobody is sending blocks.

This bug report seems to address the last issue. Even if we could solve all three issues, getting the server to sync at full speed with out babysitting, still I don't think it would be fast. When everything is running smoothly, progress is only about 1 block per second with 100% hard-disk IO usage. At 1 block per second, a full sync would take about 4 days. An SSD would obviously be helpful here.

evoskuil commented 9 years ago

I've seen the high resource utilization, but in each case I'm able to terminate the process normally, even if it takes a significant amount of time for all threads to end and for the memory maps to close. This applies to Windows, OSX and Linux.

@genjix and I were both able to achieve 48 hour sync about 5 months back. I use machines that are over 5 years old with mechanical hard drives. The sync time started increasing to the point where I could no longer complete (trying up to 2 weeks), over a period of a couple of months.

I don't think he was seeing these issues during development and test. I've reviewed code changes over that period and I haven't seen anything that would indicate a material change occurred. I think the theory that the network failure rate is driving the other issues, possibly related to the 0.10.0 release, makes the most sense.

I've looked very closely at the blockchain code, along with @pmienk. I've stepped through its operation and I eventually found an off-by-one indexing error in one of the indexes. But apart from that I've seen no design or other problems with it. It's a pretty simple design. However I have not had a chance to review the P2P networking layer at all.

atweiden commented 9 years ago

I have another possibly relevant error report. No writes to the blockchain directory have taken place in over 9 hours. Despite no writes, I'm seeing a consistent 500k-2.0MB/s down but I can't see what is being downloaded or why it isn't being written anywhere. This began at around block 337,929.

bs stdout is essentially limited to these messages (repeating):

11:05:12.880293 ERROR [poller]: Send problem: Service stopped
11:05:16.791245 ERROR [protocol]: Channel stopped (outbound): Service stopped
11:34:01.254067 ERROR [session]: Requesting data: Service stopped

strace

[pid 12163] 11:09:21.614277 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"\377\377\377\2\n\t5\0\0\0\0\0\31v\251\24\366\300nn&8\v\274\221\346\35\20\337\33Ws"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.614308 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.614335 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.614589 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"\0\2\f\r\354\v\254\210S0$n@\363\306\377m\16\273\21\267\355\366\247\225\273m\273\250\200,g"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.614620 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.614649 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.614909 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{" \32\20\352/%\347\302~\356\241\346\314\367\343\371\277\177\340L\233\233\7\332\301\266\21\352\314?\262W"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.614940 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.614967 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.615221 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"H0E\2!\0\367\313\315\34\322\25\25\6\316_<nV\301\206Y\261\357.\317\330\3\314\311\35\2"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.615252 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.615278 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.615530 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"9\16\213`V\265^\340\322\342S2|XU\200ph\310\373\341`y[\330n\273\377\310=\315\363"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.615562 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.615589 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.615841 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"\251\24\231N\226:\v\30)>.\264\33@\177uWS\354\333\347*\210\254\266;\1\0\0\0\0\0"..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.615872 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12163] 11:09:21.615900 epoll_wait(19, {{EPOLLIN, {u32=1409335504, u64=140124717367504}}}, 128, -1) = 1
[pid 12163] 11:09:21.616154 recvmsg(43, {msg_name(0)=NULL, msg_iov(1)=[{"*6\16\177?\365g\357]!\367\2\6\306\245\260\241\331yg\254?\1\0\0\0\212G0D\2 "..., 65536}], msg_controllen=0, msg_flags=0}, 0) = 1388
[pid 12163] 11:09:21.616185 recvmsg(43, 0x7f8704a54a30, 0) = -1 EAGAIN (Resource temporarily unavailable)

I've restarted the server a few times but it does not make a difference.

On the filesystem, the only other peculiarity is the ~/.libbitcoin/blockchain/db-lock file hasn't been removed or updated in four days, though I've shut down the server and restarted it cleanly many times since then.

evoskuil commented 9 years ago

This is the case of OpenSSL breaking consensus. If you want to use libbitcoin-consensus you will need to downgrade OpenSSL.

This is an issue with the Satoshi client and OpenSSL on which the optional consensus library is based.

We will be updating docs and tests cases to cover this. If we can find a good way to test OpenSSL in configure we'll also do that.

atweiden commented 9 years ago

Would it make a difference if I compiled libbitcoin-consensus with options --without-openssl --with-secp256k? This is what I used:

./configure \
  --prefix=/usr \
  --sbindir=/usr/bin \
  --libexecdir=/usr/lib/libbitcoin-consensus \
  --sysconfdir=/etc \
  --sharedstatedir=/usr/share/libbitcoin-consensus \
  --localstatedir=/var/lib/libbitcoin-consensus \
  --with-gnu-ld \
  --with-secp256k1 \
  --without-openssl \
  --without-tests
evoskuil commented 9 years ago

Yes, that should preclude the problem. As long as you are certain it's not linked to OpenSSL then it's a different issue. But strangely coincidental that it's happening in the same place.

evoskuil commented 9 years ago

See tracking issue I've added to libbitcoin-consensus: https://github.com/libbitcoin/libbitcoin-consensus/issues/14

evoskuil commented 9 years ago

I think we have tracked this to mis-parameterization (flags) of libbitcoin-consensus in libbitcoin-blockchain.

evoskuil commented 9 years ago

I've checked in a correction to the flags used when calling libbitcoin-consensus from libbitcoin-blockchain. This was the cause of sync halting and triggering of DoS protection by bitcoind peers.