neutrinolabs / xrdp

xrdp: an open source RDP server
http://www.xrdp.org/
Apache License 2.0
5.75k stars 1.73k forks source link

Timeouts in unit tests - test_gen_key_xrdp1, test_bitmap_load__max_width_{zoom,scale}__ok #2213

Closed elboulangero closed 10 months ago

elboulangero commented 2 years ago

Hi, recently xrdp 0.19 was uploaded to Debian. Some tests timed out on some architectures: https://buildd.debian.org/status/package.php?p=xrdp&suite=sid

armhf architecture - test_gen_key_xrdp1 timed out

test_ssl_calls.c:340:E:ssl_calls:test_gen_key_xrdp1:0: (after this point) Test timeout expired

These tests were introduced in v0.9.19.

From what I see, this test generates a RSA key, so it's very possible (and legitimate) that it times out, I suppose. What about using tcase_set_timeout for this test, and set it to something real long, like 1 minute? It seems that by default, the timeout is 4 seconds.

Complete build logs at https://buildd.debian.org/status/fetch.php?pkg=xrdp&arch=armhf&ver=0.9.19-1&stamp=1648811785&raw=0

alpha architecture - test_bitmap_loadmaxwidth{zoom,scale}ok

PASS: test_xrdp 1 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__with_invalid_image__fail: Passed
PASS: test_xrdp 2 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__4_bit__ok: Passed
PASS: test_xrdp 3 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__8_bit__ok: Passed
PASS: test_xrdp 4 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__24_bit__ok: Passed
FAIL: test_xrdp 5 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__max_width_zoom__ok: Test timeout expired
PASS: test_xrdp 6 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__max_height_zoom__ok: Passed
PASS: test_xrdp 7 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__min_zoom__ok: Passed
FAIL: test_xrdp 8 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__max_width_scale__ok: Test timeout expired
PASS: test_xrdp 9 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__max_height_scale__ok: Passed
PASS: test_xrdp 10 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__min_scale__ok: Passed
PASS: test_xrdp 11 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__not_4_pixels_wide_4_bit__ok: Passed
PASS: test_xrdp 12 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__not_4_pixels_wide_8_bit__ok: Passed
PASS: test_xrdp 13 - test_bitmap_load.c:xrdp_bitmap_load:test_bitmap_load__not_4_pixels_wide_24_bit__ok: Passed

These tests were introduced in v0.9.18, and Debian jumped straight from v0.0.17 to v0.0.19, so we hit those timeouts only now.

I'm not sure, but it seems that the Alpha architecture is long dead, so it's not super important. Still, it wouldn't take much to fix those tests, just add a tcase_set_timeout, I guess.

Complete build logs at https://buildd.debian.org/status/fetch.php?pkg=xrdp&arch=alpha&ver=0.9.19-1&stamp=1648818829&raw=0

Patch is on the way

Please tell me what you think, if you're ok with tcase_set_timeout, how long a timeout you'd like, and I'll submit a patch asap. Thanks!

EDIT: That was straightforward so I already opened a merge request: https://github.com/neutrinolabs/xrdp/pull/2214

matt335672 commented 2 years ago

Hi @elboulangero

Great you're looking at this. We've added quite a few tests recently, including the failing ones above.

It's always possible the timeouts are due to coding errors of course, or even code which could do with being optimised. On my x86_64 machine (Old 3rd Gen Xeon E3), the common tests including the RSA test take 150ms to run and the bitmap tests take 500ms to run. The RSA test in particular seems way out - I can't see my old machine being 40 times quicker than an Alpha, unless you've got a particularly old one there.

Have you tried the timeouts for the Alpha? Do they work?

elboulangero commented 2 years ago

Hello @matt335672 ! All I tried so far is that the PR #2214 builds (no compilation error). I don't know yet if it solves the timeouts, as I can only test locally on x86_64 (ie. my own machine). However what I can do is try to push this patch in Debian, so that a new version of the Debian package gets built on all the builders for all the architectures. It's probably the simpler way to see if the patch works, so I'm going to do that and submit the patch to the Debian maintainer.

the common tests including the RSA test take 150ms to run and the bitmap tests take 500ms to run

The time it takes for the tests is not only related to the hardware, but also how busy the machine is. The machines that build Debian packages would typically build and test several packages at once, so the tests might take much longer due to that. Also for the RSA key generation, I think the time it takes to generate a key also depends on the entropy available in the system. If there's not enough entropy, I guess the call just waits until there's enough.

That's why I proposed to set a very long timeout for this one (60 seconds).

Have you tried the timeouts for the Alpha? Do they work?

Let me see if the Debian maintainer wants to accept this patch.

matt335672 commented 2 years ago

That sounds like a sensible way forward. If we've still got a problem, we can consider other ways to look into it.

I'm not super keen on increasing the timeout for all the tests in the suite however, as it may mask other problems. We can always move these particular tests to their own suite with their own timeouts if they are slow on Alpha.

elboulangero commented 2 years ago

I'm not super keen on increasing the timeout for all the tests in the suite however, as it may mask other problems

So just to clarify: that's not what I propose either. The PR #2214 is about increasing timeouts for some test cases. Each test case might contain more than one test. Apparently it's not possible to increase the timeout per test: the function tcase_set_timeout() acts on all the tests within the test case.

Cf https://libcheck.github.io/check/doc/check_html/check_4.html#Test-Timeouts:

The timeout for a specific test case, which may contain multiple unit tests, can be changed with the tcase_set_timeout() function.

Also of interest in this doc:

The default timeout used for all test cases can be changed with the environment variable CK_DEFAULT_TIMEOUT

It's good to know, but it's not what I propose. Since only a few particular tests have been seen to fail, I think it's a better approach to target those tests in particular.

Anyway, I'll let you know if/when the patch is accepted in Debian, and if it fixes the issues. Thanks for your quick feedback!

matt335672 commented 2 years ago

Sorry @elboulangero - I'm getting my terms mixed up. I think we need to separate out the unit tests which may take longer into separate cases, rather than increasing the timeouts for all the cases we currently have. So I think we're really talking about the same thing.

elboulangero commented 2 years ago

@matt335672 I understand your point now, thanks for clarifying! Indeed we're on the same line

matt335672 commented 10 months ago

[Copied from #2214]

Some more analysis.

At the moment, all the SSL calls are lumped together in one test suite which has a common timeout.

I can reproduce this by running the test suite under valgrind to simulate a slow machine. What I'm finding is that pretty much all the time is being taken up in ssl_gen_key_xrdp1(). Drilling into this, all the time is being generated in the call to EVP_PKEY_generate() (i.e. in the SSL library call).

I've got a Xeon E3-1275 v3 which is about a decade old now. On that, the call to EVP_PKEY_generate() normally takes 150ms, but amazingly takes 15-30 seconds under valgrind. The next slowest test under valgrind in this suite is test_sha1_hash_ok() at 112ms.

I'll put together a more nuanced version of this PR to split the test case up a bit and add an extended timeout to the ssl_gen_key_xrdp1() case only.