Stall - Githubissues

michaelquigley commented 4 years ago

Determine the root cause of the tx/rx stall.

Now that the rxPortalSz signalling is in place and the flow control appears to be in a stable place, tuning-wise... We need to figure out the root cause of these stalls:

[  70.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.9 MB/sec [:4934, 1 pending]
[  71.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.3 MB/sec [:4801, 0 pending]
[  72.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.5 MB/sec [:4852, 1 pending]
[  73.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.4 MB/sec [:4832, 0 pending]
[  74.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 17.0 MB/sec [:4027, 0 pending]
[  75.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.2 MB/sec [:4775, 0 pending]
[  76.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.8 MB/sec [:4907, 1 pending]
[  77.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 19.7 MB/sec [:4659, 0 pending]
[  78.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.6 MB/sec [:4874, 0 pending]
[  79.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.7 MB/sec [:4899, 0 pending]
[  80.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.2 MB/sec [:4771, 0 pending]
[  81.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.6 MB/sec [:4883, 0 pending]
[  82.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.5 MB/sec [:4856, 0 pending]
[  83.002]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.5 MB/sec [:4849, 1 pending]
[  84.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.2 MB/sec [:4771, 0 pending]
[  85.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.2 MB/sec [:4784, 1 pending]
[  86.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.5 MB/sec [:4855, 0 pending]
[  87.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 19.9 MB/sec [:4707, 0 pending]
[  88.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.3 MB/sec [:4800, 0 pending]
[  89.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.3 MB/sec [:4805, 0 pending]
[  90.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 20.1 MB/sec [:4761, 0 pending]
[  91.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 21.0 MB/sec [:4960, 0 pending]
[  92.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 12.3 MB/sec [:2914, 0 pending]
[  93.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  94.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  95.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  96.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  97.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  98.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]
[  99.001]    INFO dilithium/protocol/loop.(*transferReporter).run: [rx] 0 B/sec [:-1, 0 pending]

michaelquigley commented 4 years ago

This may actually be a transmitter-side issue:

[  90.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 21.3 MB/sec [:5028, 0 pending]
[  91.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 21.1 MB/sec [:4982, 0 pending]
[  92.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 18.9 MB/sec [:4472, 0 pending]
[  93.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 11.7 MB/sec [:2761, 0 pending]
[  94.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]
[  95.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]
[  96.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]
[  97.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]
[  98.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]
[  99.865]    INFO dilithium/protocol/loop.(*transferReporter).run: [tx] 0 B/sec [:-1, 0 pending]

michaelquigley commented 4 years ago

Appears to be related to a large, out-of-date rxPortalSz value in the txPortal. And, when there is nothing to be monitored for retx, we will not provoke any ACK traffic, which leads to the stalled condition.

openziti / dilithium

Stall #67