rakudo / rakudo

🦋 Rakudo – Raku on MoarVM, JVM, and JS
https://rakudo.org/
Artistic License 2.0
1.72k stars 373 forks source link

zef no longer passes CI tests #3157

Open ugexe opened 5 years ago

ugexe commented 5 years ago

The Problem

zef no longer finishes CI testing on appveyor. Seems to be related to runtime module in some way. See -- https://ci.appveyor.com/project/ugexe/zef/builds/27095833#L5190 (which has been rerun multiple times always stopping in the same spot)

Expected Behavior

rakudo doesn't deadlock when running zefs CI testing on appveyor like https://ci.appveyor.com/project/ugexe/zef/builds/26433562

Actual Behavior

zef deadlocks while trying to use runtime loaded code

Steps to Reproduce

Copy zef .appveyor.yml to a repository and have appveyor process it.

Environment

ugexe commented 5 years ago

On a local windows VM it gets stuck when installing zef during precompilation

C:\Users\ugexe\zef>set RAKUDO_LOG_PRECOMP=1

C:\Users\ugexe\zef>perl6 -I. bin\zef install C:\Users\ugexe\zef --debug
===> Fetching: C:\Users\ugexe\zef
Fetching with plugin: Zef::Service::FetchPath+{<anon|1>}
===> Fetching [OK]: C:\Users\ugexe\zef to C:\Users\ugexe/.zef/tmp\1567304861.7128.1365\zef\zef_1567304861
===> Extracting: C:\Users\ugexe\zef
Extracting with plugin: Zef::Service::FetchPath+{<anon|1>}
===> Extraction [OK]: C:\Users\ugexe\zef to C:\Users\ugexe/.zef/store\zef_1567304861
===> Dependencies: NativeCall, Test
===> Filtering: zef:ver<0.7.4>:auth<github:ugexe>:api<0>
===> Filtering [OK] for zef:ver<0.7.4>:auth<github:ugexe>:api<0>
===> # SKIP: No need to build zef:ver<0.7.4>:auth<github:ugexe>:api<0>
===> Testing: zef:ver<0.7.4>:auth<github:ugexe>:api<0>
Testing with plugin: Zef::Service::Shell::prove+{<anon|1>}
t/00-load.t ....................... ok
t/distribution-depends-parsing.t .. ok
t/identity.t ...................... ok
t/utils-filesystem.t .............. ok
All tests successful.
Files=4, Tests=16, 46 wallclock secs ( 0.05 usr +  0.00 sys =  0.05 CPU)
Result: PASS
===> Testing [OK] for zef:ver<0.7.4>:auth<github:ugexe>:api<0>
===> Installing: zef:ver<0.7.4>:auth<github:ugexe>:api<0>
Installing Zef::Fetch for zef
Installing Zef::Service::Shell::tar for zef
Installing Zef::Build for zef
Installing Zef::Service::Shell::PowerShell::download for zef
Installing Zef::Service::Shell::prove for zef
Installing Zef::Extract for zef
Installing Zef::Utils::FileSystem for zef
Installing Zef::Service::Shell::p5tar for zef
Installing Zef::Repository::Ecosystems for zef
Installing Zef::Test for zef
Installing Zef::Service::Shell::PowerShell::unzip for zef
Installing Zef::Service::Shell::DistributionBuilder for zef
Installing Zef::Service::Shell::Test for zef
Installing Zef::Service::Shell::wget for zef
Installing Zef::Utils::SystemInfo for zef
Installing Zef::Config for zef
Installing Zef::Repository::MetaCPAN for zef
Installing Zef::Repository for zef
Installing Zef::Distribution for zef
Installing Zef for zef
Installing Zef::Service::Shell::LegacyBuild for zef
Installing Zef::Distribution::DependencySpecification for zef
Installing Zef::Client for zef
Installing Zef::Service::Shell::git for zef
Installing Zef::CLI for zef
Installing Zef::Utils::SystemQuery for zef
Installing Zef::Distribution::Local for zef
Installing Zef::Utils::URI for zef
Installing Zef::Service::Shell::PowerShell for zef
Installing Zef::Repository::LocalCache for zef
Installing Zef::Service::FetchPath for zef
Installing Zef::Service::Shell::curl for zef
Installing Zef::Service::TAP for zef
Installing Zef::Service::InstallPM6 for zef
Installing Zef::Identity for zef
Installing Zef::Report for zef
Installing Zef::Install for zef
Installing Zef::Service::P6CReporter for zef
Installing Zef::Service::Shell::unzip for zef
Precompiling AB55A923EA5219C0C4EEC86259A425788E26F74D (Zef)

# never progresses beyond this point
ugexe commented 5 years ago

commit 3ed7e46 seems to be where the deadlock issue starts.

commit 19e075f has no deadlock, but has a new error elsewhere in the CI testing process -- https://ci.appveyor.com/project/ugexe/zef/builds/27150039

@timo ?

ugexe commented 5 years ago

With MVM_SPESH_DISABLE=1 the deadlock goes away

ugexe commented 5 years ago

@timo this appears to be related to your work (linked earlier)

ugexe commented 4 years ago

So without feedback I’m becoming inclined to revert the commits in question. If someone has a good reason why I shouldn’t and also has a hunch on how to fix this then speak up.

timo commented 4 years ago

sorry, i didn't notice that i was pinged on github, i'll have a closer look now

timo commented 4 years ago

i don't see a way for the commit in question to cause deadlock problems; can you hook up a debugger to the deadlocked process and print every thread's backtrace? ideally turning off the JIT in the environment so that backtraces don't get corrupted. that should give a lot of helpful information

ugexe commented 4 years ago

I could probably figure out how to do all of that for a precompiling process on windows. However, I suspect it will be easier for you to give me suggestions of moar changes to make and test, i.e. switching these two lines https://github.com/MoarVM/MoarVM/blob/3faf1985d277fc53c111464b73aac87f27158e7d/src/moar.c#L394-L397 (I have no reason to believe that is wrong... just mutexes can cause deadlocks so maybe it should be scrutinized)

timo commented 4 years ago

https://github.com/MoarVM/MoarVM/commit/c5a600799bdd5797b22c3e555f920e6f4c52adee - please try this, among other things it moves initializing that mutex up, but mostly disables anything related to the whole subsystem

ugexe commented 4 years ago

Can you rebase this so it works with perl Configure.pl --gen-moar=c5a600799bdd5797b22c3e555f920e6f4c52adee --gen-nqp --backends=moar ?

https://ci.appveyor.com/project/ugexe/zef/builds/27603566#L74

ugexe commented 4 years ago

@timo ping -- I don't have write access to MoarVM so I can't push up a rebase that I can pass to the Configure.pl command in the appveyor config.

ugexe commented 4 years ago

If I cherry-pick https://github.com/MoarVM/MoarVM/commit/c5a600799bdd5797b22c3e555f920e6f4c52adee and resolve conflicts the deadlock goes away.

ugexe commented 4 years ago
diff --git a/src/gc/orchestrate.c b/src/gc/orchestrate.c
index 42a52f45b..96ee115ce 100644
--- a/src/gc/orchestrate.c
+++ b/src/gc/orchestrate.c
@@ -469,7 +469,7 @@ static void run_gc(MVMThreadContext *tc, MVMuint8 what_to_do) {
         data[0] = MVM_load(&tc->instance->gc_seq_number);
         data[1] = start_time / 1000;
         data[2] = (start_time - tc->instance->subscriptions.vm_startup_time) / 1000;
-        data[3] = (end_time - start_time) / 1000;
+        //data[3] = (end_time - start_time) / 1000;
         data[4] = gen == MVMGCGenerations_Both;
         data[5] = tc->gc_promoted_bytes;
         data[6] = MVM_load(&tc->instance->gc_promoted_bytes_since_last_full);

This seems to fix it?

tbrowder commented 4 years ago

Maybe weird line endings?

ugexe commented 4 years ago

CI tests are good on moar HEAD now

AlexDaniel commented 4 years ago

Thank you, @ugexe!

AlexDaniel commented 4 years ago

OK, I'll reopen it in case someone has an idea how to write a test for this.

If you stumble upon this ticket a year later, please just close it.