Peer thread running wild

GoogleCodeExporter commented 9 years ago

I just witnessed two Peer threads downloading at the same time.

It started with one thread downloading quite slow (5 blocks/second). I decided 
to stop the PeerGroup, because I did not want to wait this long. But the thread 
did not stop, it continued downloading. I started the app again (instantiating 
a new PeerGroup I guess), and now two Peers were downloading at the same time, 
but at different block heights. The app continuously switched between "2 weeks 
behind" and "10 weeks behind". At some point the running wild Peer stopped, but 
it really took a while. I hope my blockchain was not damaged by this incident.

Original issue reported on code.google.com by andreas....@gmail.com on 21 Mar 2012 at 7:59

GoogleCodeExporter commented 9 years ago

Looks like it is damaged. I just got this exception which I never got:

E/AndroidRuntime(20011): FATAL EXCEPTION: PeerGroup-4-thread-1
E/AndroidRuntime(20011): java.lang.NullPointerException
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.BlockChain.findSplit(BlockChain.java:322)
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.BlockChain.connectBlock(BlockChain.java:238)
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.BlockChain.add(BlockChain.java:210)
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.BlockChain.add(BlockChain.java:135)
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.Peer.processBlock(Peer.java:336)
E/AndroidRuntime(20011):    at com.google.bitcoin.core.Peer.run(Peer.java:201)
E/AndroidRuntime(20011):    at 
com.google.bitcoin.core.PeerGroup$4.run(PeerGroup.java:547)
E/AndroidRuntime(20011):    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1076)
E/AndroidRuntime(20011):    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:569)
E/AndroidRuntime(20011):    at java.lang.Thread.run(Thread.java:856)

Original comment by andreas....@gmail.com on 21 Mar 2012 at 8:01

GoogleCodeExporter commented 9 years ago

As to why the download peer did not die - are you sure PeerGroup.stop() was 
called?

As to the corruption - could it be that two BlockChain objects were created by 
the app, both referring to the same underlying BlockStore?

BlockChain.add() is synchronized, so having two download peers should not cause 
corruption.

Original comment by mi...@google.com on 21 Mar 2012 at 9:59

GoogleCodeExporter commented 9 years ago

I'm pretty sure stop() was called.

The two BlockChain objects are very well possible, and I cannot prevent it - 
since PeerGroup.stop() is not synchronous.

Original comment by andreas....@gmail.com on 21 Mar 2012 at 10:06

GoogleCodeExporter commented 9 years ago

BlockChain objects are created by you, the app owner. :)

So you just have to make sure that you only create one per BlockStore.

Original comment by mi...@google.com on 21 Mar 2012 at 10:10

GoogleCodeExporter commented 9 years ago

Each instance of the Apps Service creates exactly one BlockStore, one 
BlockChain and one PeerGroup. But I cannot know how long PeerGroup is still 
using its BlockChain/BlockStore because PeerGroup.stop() is not synchronized. 
See what I mean?

So if stop() takes 30 minutes (which it did in the above case), chances are 
very high that the user restarts the app in this timeframe and thus second 
instances of PeerGroup/BlockChain/BlockStore might be created (I'm not sure). 
If this is the case, AFAIK I cannot do anything about it.

Making PeerGroup.stop() synchronized is not a solution, because that would 
block the main thread (for up to 30 minutes...)

IMHO we really should make sure Peer threads die very soon after stop() is 
called - but without leaving BlockStores or Wallets in an inconsistent state.

Perhaps my SQLiteBlockStore would help because databases are typically designed 
for concurrent access from different processes.

Original comment by andreas....@gmail.com on 21 Mar 2012 at 11:08

GoogleCodeExporter commented 9 years ago

If the app is really restarted (the old app instance is killed), then the VM is 
destroyed.

PeerGroup.stop() should be fast, because it closes all the peer sockets.  
Unless of course there is a bug.

After you call PeerGroup.stop(), do you create a new PeerGroup?  Maybe I could 
look at that part of your code.

Original comment by mi...@google.com on 21 Mar 2012 at 11:49

GoogleCodeExporter commented 9 years ago

If one Thread is busy backscanning the block chain (which can take about 20 
seconds), PeerGroup.stop() isn't fast but wait for the thread to complete.

Original comment by andreas....@gmail.com on 22 Mar 2012 at 8:06

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 92398d2c4770.

Original comment by mi...@google.com on 26 Mar 2012 at 7:18

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I'm not confident that this resolves the issue, but lets see.

Original comment by mi...@google.com on 26 Mar 2012 at 7:19

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 9474eaa0d4dc.

Original comment by mi...@google.com on 26 Mar 2012 at 10:24

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 1e52a6eccce0.

Original comment by mi...@google.com on 26 Mar 2012 at 11:16

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Another try - this should work better on sockets with a connect in progress.

Original comment by mi...@google.com on 26 Mar 2012 at 11:18

GoogleCodeExporter commented 9 years ago

I did some println testing. Usually I start the app, wait until some 
"Connecting to peer" messages appear and then stop the app (which calls 
PeerGroup.close())

Even with the latest version, Timeouts talking to Peer do not time out until 
60s after.

Some more things:

Peer.disconnect and TCPNetworkConnection.shutdown() seem to called only for 
Peers that are already connected - not for Peers that are still waiting to be 
connected.

On one Peer, I am observing a disconnect and shutdown but the shutdown seems to 
hang indifinitely at socket.shutdownOutput().

And even more: After shutting down, I still see the occasional Connecting to 
Peer message.

I think I mentioned this some a year ago: Did you already look at Apache Mina? 
They should have solved the same problems, and when I used it 3 years ago for a 
low-level-socket protocol, it worked quite flawlessly.

Original comment by andreas....@gmail.com on 27 Mar 2012 at 9:19

GoogleCodeExporter commented 9 years ago

Yeah, this is getting more and more complicated. Definitely worth checking out 
a library. Or just switching fully to NIO.

Original comment by hearn@google.com on 27 Mar 2012 at 9:45

GoogleCodeExporter commented 9 years ago

The latest patch appears to resolve the disconnect/shutdown methods not called 
for all peers.

However, I still get occasional connects after the PeerGroup is well into its 
shutdown process. And I still get Timeout talking to Peer after 60s.

Original comment by andreas....@gmail.com on 27 Mar 2012 at 4:01

GoogleCodeExporter commented 9 years ago

Another try: revision 3bc999a032b7d7cae5b285ddd36868d2ad453459.

Original comment by mi...@google.com on 27 Mar 2012 at 5:30

GoogleCodeExporter commented 9 years ago

The timeout after 60s is still there. For each timeout there is now an 
additional disconnect/shutdown after the timeout occured.

Original comment by andreas....@gmail.com on 27 Mar 2012 at 11:35

GoogleCodeExporter commented 9 years ago

Out of ideas - would need to use a debugger to figure it out.

I started work on an NIO solution, but this will take awhile.

Original comment by mi...@google.com on 28 Mar 2012 at 3:24

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I think this is fixed now, we did a lot of improvements in this area and I 
didn't see it happen for a long time.

Original comment by hearn@google.com on 14 Dec 2012 at 4:07

Changed state: Fixed

novitski / bitcoinj

Peer thread running wild #161