gsliepen / tinc

a VPN daemon
http://tinc-vpn.org/
Other
1.87k stars 280 forks source link

Rewrite tests in Python for better OS compatibility #362

Closed hg closed 2 years ago

hg commented 2 years ago

These depend only on the standard library of Python 3.6 or later (used by RHEL 7; other distributions have newer versions). If system Python is missing, it falls back to the one used by meson.

Tests are using a somewhat different approach compared to the current suite. There are no hardcoded ports and Port 0 is used everywhere, but now we have to do the dance of starting nodes in a particular order and parsing their pidfiles to know the real port they listen on.

You have to pay more attention when writing tests, but it's the only reliable solution I could find from extensively testing a dozen ideas that allows us to keep full parallel test execution.

Some other things that were tried - hardcoding port numbers, like in the current test suite — forces you to run all tests serially, or drop support for `--repeat` (which I don't think is possible to disable, so it's just something you have to know about); and there can still be conflicts with other software. - generating random ports using `random.randint()` with shared port range for all tests (1024 to 65535) — tolerable when running 4 tests in parallel, unusable with 16 or more. - same, but splitting it into smaller unique ranges, one for each test (slightly better than the previous one, not worth it). - same, but using hacks to try to exclude ports typically used for `connect()` (hopefully less conflicts with "client" ports, not much difference in practice). - binding a throwaway socket to port 0 and letting the OS choose a port, then binding tincd to that port (significantly more conflicts than trying ports randomly, [Linux seems][linux] to use a very short port range, and further halves it to leave the other half for "client" ports, and this really shows in practice). - using random ports and binding tincd to randomly generated addresses in `127.0.0.0/8` — good stability and clear test code, but requires configuring the system on everything that's not Linux or Windows. - using a separate server which can be asked by test scripts to pick a port using one of the previous methods (so port allocation happens in a single place instead of spreading it throughout tests — it can make sure it doesn't give the same port twice) — not bad in practice, but is so convoluted I didn't push it past a simple prototype. - locking each test on a separate lock file to prevent it from running multiple copies in parallel. - and a few other hacks.

If you'd rather keep tests simpler, I don't see any other solution than to disable parallel test execution completely on everything other than Linux and Windows (where we can use the full subnet 127.0.0.0/8 and randomly generate both addresses for ListenAddress and ports for Port), or write an alternative test harness that doesn't run the same test multiple times in parallel (you can't disable this with meson).

Basic architecture

I intentionally kept things simple so there are fewer layers to unwrap if it gets in the way in more complicated tests.

  1. meson starts the test script
  2. test imports the required pieces of the testing library
  3. it starts a notification server which receives event notifications from tincd scripts (the same idea we had in the old test suite, but here it's using UNIX sockets/Windows pipes instead of tail + grep)
  4. you create a test context (a class named Test), and use it to get instances of a class named Tinc (the context remembers Tincs it has issued and uses them as described below)
  5. you initialize the nodes using that Tinc (it's a thin wrapper around subprocess), and start issuing commands to them using one of these methods:
    • tinc() — start tinc with the specified arguments and return Popen
    • tincd() — same, but for tincd
    • cmd() — a convenience wrapper around tinc() which can pass text snippets into stdin, checks the return code for you, and returns (stdout, stderr)
    • and a few more, it's all commented.
  6. when a test context exits (leaves the with statement), it stops every tinc and tincd started from it, and kills everything that refuses to stop.

If you initialize instances of the Tinc class directly, they won't be stopped automatically.

Tinc can also register tincd scripts and returns objects that let you wait on notifications from those scripts. Only one event is needed now (and thus only one is supported), but it's easy to add more.

There are some assertions (which should be used instead of the built-in assert because it doesn't print values on failure), a global logger, and a few helper functions for more complicated commands like exchange so the logic isn't repeated in several places.

Known issues

scripts.py

Port checks were dumbed down to checking that a non-empty value was passed. tincd on Windows behaves really strange with dynamically allocated ports on both sides of the connection, and I couldn't make it work after wasting half a day on it.

We have three choices here:

I don't think any other test functionality was lost. invite-offline.test and invite-online.test are rolled into invite.py.

Crashes on Alpine

Example.

SIGSEGV in a Python process that's running the test. Since there's no unsafe code in tests, I suspect this is because of musl.

These are pretty rare, just wanted you to be aware of them.

Failures on macOS

Same issue we have with the current test suite:

Cannot read greeting from control socket: No such file or directory

although tinc should definitely be able to connect to tincd because it is only started after we get a go-ahead from tinc-up.

So it's likely a tinc bug. I don't have access to macOS, and it's not easy to trigger to debug on CI machines. Can't help much here.


c1c5263e8d8d67aec9e06a5b680d956a19d88229 is a quick and dirty fix for #361. Don't pay much attention to it. I'll look into it while you're reviewing tests.

hg commented 2 years ago

I'd disable the test on Windows then, and add an issue to sort this out later.

OK, it's a bit worse than that. I think we're forced to generate a port on our own and then pass it to bar.

There's a small window of time before bar is started where the port can be taken by something else, so this test will be flaky and there will be false positive failures.

If Port 0 is used, REMOTEPORT on foo's side contains not the bar server port you'd connect to, but the client port which bar used to connect to foo, which is different from the one we expect. This is what I was fighting yesterday and it's not just Windows.

It feels like an accidental omission to cover the case of Port 0. Why would you need a client port of the remote node? What would you do with that information? Or is it just me being stupid again?

gsliepen commented 2 years ago

I'd disable the test on Windows then, and add an issue to sort this out later.

OK, it's a bit worse than that. I think we're forced to generate a port on our own and then pass it to bar.

There's a small window of time before bar is started where the port can be taken by something else, so this test will be flaky and there will be false positive failures.

True. I guess the chance of this might be high when running lots of tests in parallel. And you're also right that manually assigning ports is not ideal either, although it did work relatively well, especially inside containers where you are sure no ports are used anyway.

If Port 0 is used, REMOTEPORT on foo's side contains not the bar server port you'd connect to, but the client port which bar used to connect to foo, which is different from the one we expect. This is what I was fighting yesterday and it's not just Windows.

It feels like an accidental omission to cover the case of Port 0. Why would you need a client port of the remote node? What would you do with that information? Or is it just me being stupid again?

It is not what you want, but the information in $REMOTEPORT and $REMOTEADDRESS is useful for dynamically creating firewall rules for example. So I wouldn't change their behaviour.

You can actually find the port that a tincd that has Port 0 is listening on in its PID file (it's the last item on the line). What's missing is a way for tincctl to report that information. It should not be that hard to add it, since it's already parsing the PID file and getting the port, but currently it's just discarding it after creating the control socket.

gsliepen commented 2 years ago

I think scripts.py is missing an important thing that scripts.test did: the latter also checked that all the scripts ran in a specific order.

hg commented 2 years ago

Changelog:

While everything should be finished now, I'll have access to an unused 32-core server for the next couple of days and would like to loop the test suite there as an additional precaution before marking this PR as ready.

I'll look at #361 now, since c1c5263e8d8d67aec9e06a5b680d956a19d88229 is unlikely to work correctly.

hg commented 2 years ago

With latest changes, I think it's about as stable as it's going to get.

Tests were run on … - 32-core Intel-something, `--num-processes` on default 32 - EL8 - kernel: 4.18.0-348.12.2.el8_5.x86_64 - `gcc`: 8.5.0 - `lz4-devel`: 1.8.3 - `lzo-devel`: 2.08 - `meson`: 0.55.3 - `ncurses-devel`: 6.1 - `pkgconf`: 1.4.2 - `readline-devel`: 7.0 - `zlib-devel`: 1.2.11

Hours of meson test --repeat 11000 produced this:

Ok:                 175945
Expected Fail:      0
Fail:               55
Unexpected Pass:    0
Skipped:            0
Timeout:            0

54 of them in security.py, one in invite.py.

Some details All `security.py` failures look the same: ``` ERROR:security.py:Uncaught exception Traceback (most recent call last): File "/home/user/tinc/test/integration/security.py", line 128, in loop.run_until_complete(run_tests(context)) File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete return future.result() File "/home/user/tinc/test/integration/security.py", line 115, in run_tests await test_id_timeout(foo) File "/home/user/tinc/test/integration/security.py", line 45, in test_id_timeout data = await send(foo.port, "0 bar 17.7", delay=TIMEOUT * 1.5) File "/home/user/tinc/test/integration/security.py", line 39, in send raise RuntimeError("test should not have reached this line") RuntimeError: test should not have reached this line ``` and for the other one (I've seen this two or three times during 16-17k test runs in total, so it's pretty difficult to catch): ``` INFO:invite.py:join second node with localhost:33583/SvRsLgU7son8FIsLTEdwVj1BgtO9j0m0BsoZlFTPWCiSQy3q DEBUG:invite.py:starting tinc rnmig0il6q: "/home/user/tinc/build/src/tinc --net rnmig0il6q --config /home/user/tinc/build/test/integration/wd/invite.py/data/rnmig0il6q --pidfile /home/user/tinc/build/test/integration/wd/invite.py/data/rnmig0il6q/pid join localhost:33583/SvRsLgU7son8FIsLTEdwVj1BgtO9j0m0BsoZlFTPWCiSQy3q" DEBUG:invite.py:tinc rnmig0il6q: PID 1460029, in "None", want code 0DEBUG:invite.py:tinc rnmig0il6q: code 1, out "", err "Both netname and configuration directory given, using the latter... Connected to localhost port 33583... Peer has an invalid key! dehGocM/aONofz70p4IGYNqw6jCLF6j0fQEgNZ28pRJ " … ERROR:invite.py:Uncaught exception Traceback (most recent call last): File "/home/user/tinc/test/integration/invite.py", line 89, in run_invite_test(context, start_before_invite=False) File "/home/user/tinc/test/integration/invite.py", line 40, in run_invite_test bar.cmd("join", foo_invite) File "/home/user/tinc/test/integration/testlib/proc.py", line 225, in cmd check.equals(code, res) File "/home/user/tinc/test/integration/testlib/check.py", line 26, in equals raise ValueError(f'expected "{expected}", got "{actual}"') ValueError: expected "0", got "1" ```

So security.py has \~0.5% failure rate. Since it relies on timeouts, I don't see how to substantially improve that (increasing them doesn't help).

Very rarely, invite.py runs into what I believe to be a tinc bug where the author of an invitation doesn't respond to join, even if you do it manually after the test fails. Both sides just hang with zero output. It's difficult to reproduce, so I'm not reporting anything concrete yet.

Word to the wise: avoid using large values for --repeat. meson wasn't expecting such folly from its user and gobbled up 12 GBs of RAM closer to the finish line. It may have been fixed in later versions, though.


re #361: close() (aliased to CloseHandle()) was replaced with closesocket() where appropriate, since:

Do not use the CloseHandle function to close a socket. Instead, use the closesocket function, which releases all resources associated with the socket including the handle to the socket object.

Other uses look fine.

hg commented 2 years ago

Drive-by question: I remember there was some interest in moving away from C. Do you consider slowly moving the project to C++ to be worth it (like GCC has been doing since 2010)? The way they did it IIRC is first making all headers includeable into C++ files, then renaming and fixing C files one by one, and then improving from there.

Although, if we want to keep support for ancient routers, there will be no C++11 for the foreseeable future, since GCC 4.2 is needed for those, and relatively complete C++11 support became available in GCC 4.8.

gsliepen commented 2 years ago

Drive-by question: I remember there was some interest in moving away from C. Do you consider slowly moving the project to C++ to be worth it (like GCC has been doing since 2010)? The way they did it IIRC is first making all headers includeable into C++ files, then renaming and fixing C files one by one, and then improving from there.

Although, if we want to keep support for ancient routers, there will be no C++11 for the foreseeable future, since GCC 4.2 is needed for those, and relatively complete C++11 support became available in GCC 4.8.

I created #364 so we can discuss things there. I don't think anything before C++11 is desirable.

hg commented 2 years ago

Probably not. security.py could see some improvement, but I have no idea how to go about it yet.

I ran some additional tests on local VMs, and it's been fine (not as extensive, though: only 4 CPUs, 1000 iterations on Windows, and 500 on each of the three BSDs and Solaris).

gsliepen commented 2 years ago

Ok, I've been running 6 test suites in parallel for the last hour (about 2000 invocations in total), and no errors. This definitely was not possible with the old suite. So I'll merge it now and we can do improvements afterwards.