tsujio / webrtc-chord

An implementation of Chord, a protocol of Distributed Hash Table, using WebRTC.
MIT License
186 stars 15 forks source link

Simultaneous sending and closing connection by sender and receiver #8

Open tsujio opened 10 years ago

tsujio commented 10 years ago

If sending and closing for the same connection occurs simultaneously, the sending fails (maybe) without any errors.

this seems to be the cause of many timeouts. (When config.connectionPoolSize is 9999 and Node.disconnect is replaced with nop, no timeout occurs in my environment)

jure commented 10 years ago

Interesting. I'll try to repeat your experiment.

jure commented 10 years ago

Could you share the code that you're using to test this?

I've used a script: https://gist.github.com/jure/26d6e002c305a2f4f464 which creates a network of CLIENTS_NUM nodes, and every second inserts something into the network from each node of the network, and every second retrieves something from each node of the network.

I haven't actually noticed any difference between connectionPoolSize 9999/10 and Node.disconnect default/noop, but I'm seeing these in both cases:

Error {stack: (...), message: "Unknown situation."}

and:

Failed to notify and copy entries: Error {stack: (...), message: "Reached maximum number of attempts of NOTIFY_AND_COPY."}
tsujio commented 10 years ago

I use my chord-monitor (open chord-monitor.html to use) and can create about 30 nodes without timeouts. I'll also try your script.

jure commented 10 years ago

Any idea why there would be timeouts with just 30 nodes? Is there any inherent slowness in the process, i.e. WebRTC implementation browser-side?

tsujio commented 10 years ago

I tried your script and caught many timeouts. But I think your script generates too many requests and it involves timeouts.

setInterval(function () {
  _.each(chords, function (chord) {
    ...
    chord.retrieve(key, function(entries, error) {
      ...
    });
  });
}, 1000)

This code generates CLIENTS_NUM (40) * 2 (insert & retrieve) == 80 requests per second.

I measured request throughput by https://gist.github.com/tsujio/c49497d4dff96870f5ba and got the following output. (Note that there are 20 nodes in the network, 20 is because of my pc spec)

elapsed 1588 milliseconds to process 10 requests. gistfile1.js:72
elapsed 2828 milliseconds to process 60 requests. gistfile1.js:72
elapsed 3480 milliseconds to process 110 requests. gistfile1.js:72
elapsed 5902 milliseconds to process 160 requests. gistfile1.js:72
...
elapsed 30763 milliseconds to process 660 requests. gistfile1.js:72
elapsed 40212 milliseconds to process 710 requests. gistfile1.js:72
elapsed 34025 milliseconds to process 760 requests. gistfile1.js:72

Throughput is roughly 20-30 req/sec. This depends on the number of nodes in the network. (The a few initial results takes long time because the network is not convergent)

Other messages are also exchanged to maintain network, so the number of waiting requests grows and eventually many timeouts occur, I think.

jure commented 10 years ago

Thank you for the experimentation! It would be useful to log the current number of connections in this script as well, so to see if the waiting requests are the issue. I'll do that in the afternoon.

Do you think this is a throughput limitation in a global/on-line scope as well? So for example if my network has 40 nodes, each of them on a separate machine, and each of them fires 1 insert/retrieve per second, would I have timeouts as well? Or is this occurring because all nodes are running on a single thread on a single computer?

Anyway, I'll keep digging.

tsujio commented 10 years ago

I haven't actually noticed any difference between connectionPoolSize 9999/10 and Node.disconnect default/noop, but I'm seeing these in both cases:

By the two modifications, closing connections is suppressed in order to avoid conflicts of sending and closing.

"Reached maximum number of attempts of NOTIFY_AND_COPY."

This message indicates that the node failed to join appropriate position. It often occurs when many nodes join simultaneously. The position of a node will be improved by stabilize task.

"Unknown situation."

This is a bug of my code. I will investigate it.

tsujio commented 10 years ago

Do you think this is a throughput limitation in a global/on-line scope as well?

A insert/retrieve request generates log(CLIENTS_NUM) FIND_SUCCESSOR requests and 1 INSERT/RETRIEVE request, and both of sender and receiver are on the same machine. My measurement script processes all of them by single thread. So in real world the throughput would be better.

tsujio commented 10 years ago

@jure,

I've pushed #11. Please check that timeouts don't occur under an appropriate request throughput.

If ok, I'll merge it to master.