nazar-pc / webtorrent-dht

This is an example implementation of something that might become WebTorrent DHT
29 stars 3 forks source link

Unpredictable behaviour on startup #5

Closed derpsteb closed 6 years ago

derpsteb commented 6 years ago

I wrote a automated test to verify the behaviour of the library when starting up and when writing and reading data from the dht. This test could be added to this repo if wanted without problems.

But I am experiencing weird behaviour when running the test with multiple nodes. The test works as follows:

  1. Create bs node
  2. Wait for listening event of bs node
  3. Create n nodes
  4. Wait for every ready event on each node
  5. Save a string to the dht from a random node
  6. Retrieve the string by its hash from another random node

The test code is posted here: https://gitlab.com/derpsteb/p2p-signaling/blob/develop/e2e/testWebtorrent.js

The printed statements should explain what is happening very easily.

I thought it may be more readable this way. One can use this small script to run the test multiple times to see the behaviour very quickly: https://gitlab.com/derpsteb/p2p-signaling/snippets/1699083

The behaviour I am talking about is that all nodes get created without any problems. Most of the nodes also start up without problems. But most of the times one or two nodes get stuck in a state where they do not connect with the bs node but also do not fire the ready event. I tried to comprehend what is happening but did not get further than seeing the debugger run in a loop where it executes some code of the simple-peer library where a timer gets reset every time. This is the complete opposite of what was happening with my last issue :D. Maybe you find the time to take a look.

nazar-pc commented 6 years ago

Test cases are great, but I'd like to adopt bittorrent-dht's tests at some point, which should work either without changes or with minor tweaks. Also Detox project is testing a large portion of this project implicitly.

Will definitely test your observations. BTW, what loop are you talking about?

P.S. node-webrtc is used here as dependency and I have found a few segfaults and other minor issues, some of which have been fixed already, hopefully this is not one of them as i complicates debugging a lot.

derpsteb commented 6 years ago

I can not really describe the loop. But the debugger is always hung at the same line within this class: https://nodejs.org/api/async_hooks.html I can not look it up right now and will tell you tomorrow. But you would see right away when running the test with the debugger attached.

nazar-pc commented 6 years ago

That is the beauty of asynchronous code, most of the things originate there. Will work on it today.

nazar-pc commented 6 years ago

I've encountered this issue a few times, but was not yet able to identify where and why it happens, marking as a bug for now.

derpsteb commented 6 years ago

Okay. Do you already know when you will work on this? I'll also try to find any other information. Maybe any other things I can help with?

nazar-pc commented 6 years ago

I need a more reliable way to reproduce it. Running the same script 25 times to get reproduction is not very productive. I also suspect it might be caused by node-webrtc, so I'll try to tackle it again soon, but don't promise anything specific.

derpsteb commented 6 years ago

One thing I observed is that when setting the NODES variable to 12 and higher it fails pretty much always. At least it did not run a single time after executing the test around 25 times. But I will try to find something like a minimal failing demo.

nazar-pc commented 6 years ago

I found that sometimes it so happens that node receives 2 transactions with the same ID (which normally should never happen): first time as response and second time as an error, which doesn't make any sense. Will investigate further, had to increase number of nodes to 20 in order to hit the issue ~1/15 runs.

nazar-pc commented 6 years ago

I think I've fixed it in master branch, can you confirm?

derpsteb commented 6 years ago

Sorry. I just realised that I ran the test without a bs node. If doing so there is no problem. But when running the test again, with a bs node running the error comes up after running the first test with 30 nodes.

nazar-pc commented 6 years ago

With 30 nodes I more often either get a complete lockup of Node process (can inspect anything, Enter in console doesn't respond so I can't print anything) or it simply crashes altogether with segmentation fault (https://github.com/js-platform/node-webrtc/issues/362).

With 20 it is quite difficult to catch the issue, hopefully I'll figure it out today. Currently suspecting k-rpc and k-rpc-socket and I don't see anymore anything in webtorrent-dht that might cause this issue.

nazar-pc commented 6 years ago

Narrowed down to following Promise that occasionally doesn't resolve at all: https://github.com/nazar-pc/webtorrent-dht/blob/518ea895c843eb4944f0d91b3adad28855495cf9/k-rpc-socket-webrtc.ls#L197-L216

nazar-pc commented 6 years ago

Occasionally something happens with ICE and connection is closed without error (which code relied on).

Could you test master branch to confirm that the issue was indeed fixed this time?

derpsteb commented 6 years ago

It looks like it is a fix this time :D. And I did not forget to run a bs node this time.

I would like to come back tonight and check again since the test is not running smoothly yet. But it looks like it has to do something with tape or the dht.destroy() function this time. After around 15-20 runs the test does not exit anymore and is stuck after printing "Destroyed a node" NODE-times. Or at least takes quite a while to continue with execution.

nazar-pc commented 6 years ago

I've seen something like that in the past too. Try to use wrtc from this branch: https://github.com/js-platform/node-webrtc/tree/fix-segfault-on-datachannel-close This is the second segfault fix on top of latest version available on NPM and it does work better for me (this is what I've tested with already). In any case, it should exit eventually.

Also this is what I've used for tests (a bit simpler that your demo):

```javascript "use strict"; const TAPE = require("tape"); const DHT = require("webtorrent-dht"); const SERVER_OPTIONS = { nodes: [], simple_peer_opts: { config: { iceServers: [] } }, isServer: true }; const CLIENT_OPTIONS = { nodes: [{ host: "127.0.0.1", port: 16882 }], simple_peer_opts: { config: { iceServers: [] } }, isServer: false }; //This array is used to delete all created dhts so the test exits properly. global.running = []; global.ready = []; TAPE("bootstrapping test", (t) => { const NODES = 20; let p1 = new Promise((resolve) => { let bsDht = new DHT(SERVER_OPTIONS); running.push(bsDht); bsDht.listen(16882, "127.0.0.1", () => {}); bsDht.once("listening", () => { t.pass("BS node running"); resolve(); }); }); let tmpDht = null; let promises = []; p1.then(() => { for(let i = 0; i < NODES; i++){ let tmpPromise = new Promise((resolve) => { tmpDht = new DHT(CLIENT_OPTIONS); running.push(tmpDht); t.comment("Started Node " + i); ready[i] = tmpDht; tmpDht.once("ready", () => { ready[i] = null; t.pass("DHT ready event received " + i); resolve(); }); }); promises.push(tmpPromise); } //Only continue if all dhts started up properly return Promise.all(promises); }) .then(function () { console.log('done'); t.end(); }) }); TAPE.onFinish(() => { running.forEach((entry) => { entry.destroy(); }); }); ```

And BTW thanks for reporting issues and testing, this greatly helps to mature such experimental piece of software 👍

derpsteb commented 6 years ago

Yeah I have used pretty much the same thing as you just posted while testing your fixes :D. Thanks for the tip with the wrtc branch!

No problem. Thank you for responding and fixing stuff so quickly and reliably :D