rnpgp / rnp

RNP: high performance C++ OpenPGP library used by Mozilla Thunderbird
https://www.rnpgp.org
Other
194 stars 55 forks source link

Failure in Windows build in master #1104

Closed ronaldtse closed 3 years ago

ronaldtse commented 4 years ago

Description

In the master build the Windows build seems to fail by timeout; i.e.

[736]
+ cp /d/a/rnp/rnp/cmake/CTestCostData.txt /home/runneradmin/local-builds/rnp-build/Testing/Temporary
+ ctest -j2 -R '.*' --output-on-failure
Test project C:/hostedtoolcache/windows/Ruby/2.5.8/x64/msys64/home/runneradmin/local-builds/rnp-build
        Start 156: setupTestData
  1/166 Test #156: setupTestData .................................................................   Passed    0.24 sec
        Start 160: cli_tests-Encryption
        Start 159: cli_tests-Compression

Steps to Reproduce

  1. Build: https://github.com/rnpgp/rnp/runs/605970058?check_suite_focus=true
ni4 commented 4 years ago

@ronaldtse Looks like it's something with GHA (see comment https://github.com/rnpgp/ruby-rnp/pull/67#issuecomment-616207654). At least recently merged PR #1097 didn't make any windows-related changes.

ronaldtse commented 4 years ago

Got it, thanks! Will look out for the fix to come...

ronaldtse commented 4 years ago

This is the latest failure: https://github.com/rnpgp/rnp/actions/runs/91859235

D:/a/rnp/rnp/src/lib/rnp.cpp: In function 'rnp_result_t key_to_json(json_object*, rnp_key_handle_t, uint32_t)':
437
D:/a/rnp/rnp/src/lib/rnp.cpp:6324:70: error: 'TRUE' was not declared in this scope
438
 6324 |     json_object *jsorevoked = json_object_new_boolean(key->revoked ? TRUE : FALSE);
439
      |                                                                      ^~~~
440
D:/a/rnp/rnp/src/lib/rnp.cpp:6324:77: error: 'FALSE' was not declared in this scope
441
 6324 |     json_object *jsorevoked = json_object_new_boolean(key->revoked ? TRUE : FALSE);
442
      |                                                                             ^~~~~
443
make[2]: *** [src/lib/CMakeFiles/librnp.dir/build.make:551: src/lib/CMakeFiles/librnp.dir/rnp.cpp.obj] Error 1
444
make[2]: Leaving directory '/home/runneradmin/local-builds/rnp-build'
445
make[1]: *** [CMakeFiles/Makefile2:1215: src/lib/CMakeFiles/librnp.dir/all] Error 2
446
make[1]: *** Waiting for unfinished jobs....
447
ni4 commented 4 years ago

Okay, it seems I reproduced this hang locally. Will take a look...

dewyatt commented 4 years ago

Okay, it seems I reproduced this hang locally. Will take a look...

That's good news, maybe we don't have to wait for it to resolve itself after all. I wonder if it's hanging reading from a pipe or something. It's odd we haven't really changed anything though.

ni4 commented 4 years ago

@dewyatt I think it could be something updated in ctest or google test. MinGW uses latest ones (and it's a pain to use older versions, since those are not always available). Will investigate further.

ni4 commented 4 years ago

Update: I tried to downgrade MinGW packages without luck. Ended up attaching lldb to the hanged process, giving no callstack and hang in WaitForMultipleObjects(). So I think it is something in MinGW/MSYS core (search give some comparable problem in summer 2019). I will give up it for a while and probably get back to it afterwards.

dewyatt commented 4 years ago

So I tried having a go at this too. I can see rnp.exe and rnpkeys.exe both hang after main(). Some observations:

I'm not sure if it's a bug in winpthread's C++11 impl, or in botan's thread pool, or with our configuration.

I think some workarounds would be:

I also found at least one unrelated windows-specific bug, I'll do a PR for that next week.

@ni4 Are your findings looking similar?

ni4 commented 4 years ago

@dewyatt Thanks for your investigations. Actually I didn't suspect Botan update as a possible source of problems. Actually I was unable to get back to the case where things work well, so will try it tomorrow.

ni4 commented 4 years ago

@dewyatt I tracked down Botan to the minimalistic test which hangs when run alone:

TEST_F(rnp_tests, test_botan_hang_rsa)
{
    botan_rng_t rng = NULL;
    assert_int_equal(botan_rng_init(&rng, NULL), 0);

    botan_privkey_t rsa_key = NULL;
    assert_int_equal(botan_privkey_create(&rsa_key, "RSA", "1024", rng), 0);
    /* call below hangs the test if run with '1' as third parameter */
    assert_int_equal(botan_privkey_check_key(rsa_key, rng, 1), 0);
    botan_privkey_destroy(rsa_key);
    botan_rng_destroy(rng);
}

Internally it uses PK_Signer/PK_Verifier to test whether key is able to sign, and most likely this is the source of hangs (signing with RSA hangs as well).

Since now I need to focus on other things, what about to downgrade Botan in CI to the 2.13, and later get back to this (probably, creating issue in Botan's GitHub)?

dewyatt commented 4 years ago

@ni4 Confirmed, that makes sense since the thread pool is utilized in rsa_private_op (I suspect decryption would hang too).

Since now I need to focus on other things, what about to downgrade Botan in CI to the 2.13

Sure we should be able to do that. I'll do a PR today.

and later get back to this (probably, creating issue in Botan's GitHub)?

I did some brief testing and wasn't able to reproduce the hang with your minimal test when building with MSVC (with threads enabled), only with mingw-w64+winpthreads. So I lean towards it not being a bug in botan itself, but still maybe botan's configure.py can disable threads for this platform.

ni4 commented 4 years ago

To save it for later - I think this line from Botan's release notes introduced the hang: Use the library thread pool instead of a new thread for RSA computations, improving signature performance by up to 20%. (GH #2257).

dewyatt commented 4 years ago

To save it for later - I think this line from Botan's release notes introduced the hang: Use the library thread pool instead of a new thread for RSA computations, improving signature performance by up to 20%. (GH #2257).

Right that's the commit I linked to earlier - https://github.com/randombit/botan/commit/d1318e6f3e86cbf0456f0cd7c003f17b1c4200ad

ni4 commented 4 years ago

@dewyatt wow, somehow mislooked that, sorry, my fault.

ni4 commented 3 years ago

Closing this since it was fixed via #1371 and https://github.com/msys2/MINGW-packages/pull/7640