Closed tbrowder closed 10 months ago
What stage does it hang in? Is it testing? If so, would it be possible for you to clone the repo and try prove6 -I. -j1
to see what's the last successful test.
Overall, Test::Async
does a lot of testing, some tests are really slow. I'm not mentioning Apple Silicon M2 here, but my Linux box is being ran by Intel Xeon E5-2690 v4, 2.6GHz. Full test run takes ~2min.
Yes, I will do that.
I ran
prove6 -I. -j1
It hung at test 150. I killed it after 3 minutes.
I ran it again and noted it took about 35 seconds to get to test 150.
Want me to wait longer? My OS is Debian 11.
Is 150 the last visible? Then it means that 160 is the culprit. Either way, can you try running both manually? raku -I. t/<test-name>
. Perhaps it would help to pinpoint the exact location where it happens.
Testing can be really slow at some moments. First due to Test::Async
is generally slower than the standard Test
(concurrency support is costly). Second, because 190, for example, is a probability test which does start thousands of threads.
Another point, if you have STRESS_TESTING environment variable set, 170-heavily-concurrent.rakutest could be ran which might be slow as well.
But with prove6 -j1
would either 170 or 190 be somehow involved the last passed tests would be 160 and 180 respectively. So, currently I'm totally in limbo as to what's going on.
Oh, and another crazy thought just've crossed my mind. What version does it try to install?
As to the repo clone, try git pull
. I must not be the case, but the main branch could have lagged behind v0.1
, where I was doing the latest changes.
Okey dokey, wilco.
First I resynched my fork of Test::Async on Github. Then pulled my local clone of the main branch.
I tried all tests individually starting with 150, which tested okay. 160 hung as you said, printed the 1..5. I waited 2+ minutes and killed it. Set "export STRESS_TESTING=1" and ran 170, it took just a few seconds. Rest of the tests all ran quickly.
I saw no errors but now I'm running "zef test --debug ."
Still hung after 150 with no message.
Any more ideas? Any system package I might be missing?
There is nothing in 160 to make it that slow. So, there is either a deadlock, or deep recursion. What version of Rakudo do you have? Otherwise I currently have no slightest idea of what it might be.
Things I'd try to do are all about just running the test manually with various debugging measures. First of all, it'd make sense to disable all :parallel
and :random
adverbs in the test itself. This should make pinpointing of the exact subtest
that fails possible. When this is known then I could only think of the normal debugging procedures with debug prints and all sorts of techniques we have.
Raku v2023.10.
Testing 160 alone:
I first tried commenting out all :random and :parallel adverbs, all okay.
Then turned on all :random, all okay.
Then I turned on all individual test/subtest :parallel adverbs, all okay.
Then I turned on the universal :parallel at the top (all other :parallel off), halt again.
Weird. My host has 16 cores. When RAKUDO_MAX_THREADS=16, test 160 halts. When I set it to ANY other value, no halts! That is a holdover env var from a long time ago. I'm going check my other host. That env var has the same setting, but I don't remember the number of cores.
That host has 8 cores. Max threads are 16. Halts at test 160. Set max theads at 64, test passes!
What if you set the max to 8? My guess would it would be freezing again.
At least, it gives me a clue. A deadlock is possible if there are threads awaiting for an event (say, a kept promise) in blocked state (happens, for example, inside a lock), but the one to actually keep the promise can't be started because the thread pool is already exhausted.
The last release should've solved this issue.
I then gave up waiting and aborted the attempt
I tried "zef install --force-install Test::Async" and got the same behavior, and gave up and aborted the process.
Both my hosts are reasonably capable, lots of memory and SSD drives.