Open eriktaubeneck opened 2 weeks ago
I suspect this may be the same "data in flight exceeds available buffering, leading to a deadlock" behavior as we saw in the past (#1073, #1085, #1104).
In #1245 I add a helper command line option to override the active work. Either using that branch, or using a helper built with a manually adjusted value, and investigating whether reducing active_work resolves the problem would be an interesting data point.
The default active_work is currently 32,768. I would try reducing it all the way to something like 32 or 128. If that does resolve the issue, then we may want to investigate exactly where it becomes problematic.
As we've begun testing with external helper parties, we've been able to run tests across AWS, GCloud, and Azure. In running these tests, we've run into a stall in certain circumstances.
To further complicate the matter, I set up a free-tier Azure account to test with, and even with those limited resources, I get a successful run all the way up to 1M input rows. It seems the issue has something to do with the Helper Party's network settings, and I'm working with them to troubleshoot what setting that might be.
However, it still seems odd to me that with whatever settings they have, we successfully run small queries (1k, 10k, 100k) but stall reliably at 1M. Here are the logs from two such stalls:
Test 1
Helper 1 (AWS)
Helper 2 (Azure)
Helper 3 (GCloud)
Test 2
Helper 1 (AWS)
Helper 2 (Azure)
(This last message repeats a until we stop it. this is truncated to avoid redundancy.)
Helper 3 (GCloud)