Closed hg closed 2 years ago
I'd disable the test on Windows then, and add an issue to sort this out later.
OK, it's a bit worse than that. I think we're forced to generate a port on our own and then pass it to bar
.
There's a small window of time before bar
is started where the port can be taken by something else, so this test will be flaky and there will be false positive failures.
If Port 0
is used, REMOTEPORT
on foo
's side contains not the bar
server port you'd connect to, but the client port which bar
used to connect to foo
, which is different from the one we expect. This is what I was fighting yesterday and it's not just Windows.
It feels like an accidental omission to cover the case of Port 0
. Why would you need a client port of the remote node? What would you do with that information? Or is it just me being stupid again?
I'd disable the test on Windows then, and add an issue to sort this out later.
OK, it's a bit worse than that. I think we're forced to generate a port on our own and then pass it to
bar
.There's a small window of time before
bar
is started where the port can be taken by something else, so this test will be flaky and there will be false positive failures.
True. I guess the chance of this might be high when running lots of tests in parallel. And you're also right that manually assigning ports is not ideal either, although it did work relatively well, especially inside containers where you are sure no ports are used anyway.
If
Port 0
is used,REMOTEPORT
onfoo
's side contains not thebar
server port you'd connect to, but the client port whichbar
used to connect tofoo
, which is different from the one we expect. This is what I was fighting yesterday and it's not just Windows.It feels like an accidental omission to cover the case of
Port 0
. Why would you need a client port of the remote node? What would you do with that information? Or is it just me being stupid again?
It is not what you want, but the information in $REMOTEPORT
and $REMOTEADDRESS
is useful for dynamically creating firewall rules for example. So I wouldn't change their behaviour.
You can actually find the port that a tincd that has Port 0
is listening on in its PID file (it's the last item on the line). What's missing is a way for tincctl
to report that information. It should not be that hard to add it, since it's already parsing the PID file and getting the port, but currently it's just discarding it after creating the control socket.
I think scripts.py
is missing an important thing that scripts.test
did: the latter also checked that all the scripts ran in a specific order.
Changelog:
scripts.py
: check that scripts get called in the correct order.scripts.py
on Windows.lint.py
. Since it's doing almost exactly the same what reformat.py
was doing, reformat has been merged into lint. Use lint.py --fix
or the old ninja command (ninja -C build reformat
)..astylerc
for better readability.While everything should be finished now, I'll have access to an unused 32-core server for the next couple of days and would like to loop the test suite there as an additional precaution before marking this PR as ready.
I'll look at #361 now, since c1c5263e8d8d67aec9e06a5b680d956a19d88229 is unlikely to work correctly.
With latest changes, I think it's about as stable as it's going to get.
Hours of meson test --repeat 11000
produced this:
Ok: 175945
Expected Fail: 0
Fail: 55
Unexpected Pass: 0
Skipped: 0
Timeout: 0
54 of them in security.py
, one in invite.py
.
So security.py
has \~0.5% failure rate. Since it relies on timeouts, I don't see how to substantially improve that (increasing them doesn't help).
Very rarely, invite.py
runs into what I believe to be a tinc bug where the author of an invitation doesn't respond to join
, even if you do it manually after the test fails. Both sides just hang with zero output. It's difficult to reproduce, so I'm not reporting anything concrete yet.
Word to the wise: avoid using large values for --repeat
. meson wasn't expecting such folly from its user and gobbled up 12 GBs of RAM closer to the finish line. It may have been fixed in later versions, though.
re #361: close()
(aliased to CloseHandle()
) was replaced with closesocket()
where appropriate, since:
Do not use the CloseHandle function to close a socket. Instead, use the closesocket function, which releases all resources associated with the socket including the handle to the socket object.
Other uses look fine.
Drive-by question: I remember there was some interest in moving away from C. Do you consider slowly moving the project to C++ to be worth it (like GCC has been doing since 2010)? The way they did it IIRC is first making all headers include
able into C++ files, then renaming and fixing C files one by one, and then improving from there.
Although, if we want to keep support for ancient routers, there will be no C++11 for the foreseeable future, since GCC 4.2 is needed for those, and relatively complete C++11 support became available in GCC 4.8.
Drive-by question: I remember there was some interest in moving away from C. Do you consider slowly moving the project to C++ to be worth it (like GCC has been doing since 2010)? The way they did it IIRC is first making all headers
include
able into C++ files, then renaming and fixing C files one by one, and then improving from there.Although, if we want to keep support for ancient routers, there will be no C++11 for the foreseeable future, since GCC 4.2 is needed for those, and relatively complete C++11 support became available in GCC 4.8.
I created #364 so we can discuss things there. I don't think anything before C++11 is desirable.
Probably not. security.py
could see some improvement, but I have no idea how to go about it yet.
I ran some additional tests on local VMs, and it's been fine (not as extensive, though: only 4 CPUs, 1000 iterations on Windows, and 500 on each of the three BSDs and Solaris).
Ok, I've been running 6 test suites in parallel for the last hour (about 2000 invocations in total), and no errors. This definitely was not possible with the old suite. So I'll merge it now and we can do improvements afterwards.
These depend only on the standard library of Python 3.6 or later (used by RHEL 7; other distributions have newer versions). If system Python is missing, it falls back to the one used by meson.
Tests are using a somewhat different approach compared to the current suite. There are no hardcoded ports and
Port 0
is used everywhere, but now we have to do the dance of starting nodes in a particular order and parsing their pidfiles to know the real port they listen on.You have to pay more attention when writing tests, but it's the only reliable solution I could find from extensively testing a dozen ideas that allows us to keep full parallel test execution.
Some other things that were tried
- hardcoding port numbers, like in the current test suite — forces you to run all tests serially, or drop support for `--repeat` (which I don't think is possible to disable, so it's just something you have to know about); and there can still be conflicts with other software. - generating random ports using `random.randint()` with shared port range for all tests (1024 to 65535) — tolerable when running 4 tests in parallel, unusable with 16 or more. - same, but splitting it into smaller unique ranges, one for each test (slightly better than the previous one, not worth it). - same, but using hacks to try to exclude ports typically used for `connect()` (hopefully less conflicts with "client" ports, not much difference in practice). - binding a throwaway socket to port 0 and letting the OS choose a port, then binding tincd to that port (significantly more conflicts than trying ports randomly, [Linux seems][linux] to use a very short port range, and further halves it to leave the other half for "client" ports, and this really shows in practice). - using random ports and binding tincd to randomly generated addresses in `127.0.0.0/8` — good stability and clear test code, but requires configuring the system on everything that's not Linux or Windows. - using a separate server which can be asked by test scripts to pick a port using one of the previous methods (so port allocation happens in a single place instead of spreading it throughout tests — it can make sure it doesn't give the same port twice) — not bad in practice, but is so convoluted I didn't push it past a simple prototype. - locking each test on a separate lock file to prevent it from running multiple copies in parallel. - and a few other hacks.If you'd rather keep tests simpler, I don't see any other solution than to disable parallel test execution completely on everything other than Linux and Windows (where we can use the full subnet
127.0.0.0/8
and randomly generate both addresses forListenAddress
and ports forPort
), or write an alternative test harness that doesn't run the same test multiple times in parallel (you can't disable this with meson).Basic architecture
I intentionally kept things simple so there are fewer layers to unwrap if it gets in the way in more complicated tests.
tail
+grep
)Test
), and use it to get instances of a class namedTinc
(the context remembersTinc
s it has issued and uses them as described below)Tinc
(it's a thin wrapper aroundsubprocess
), and start issuing commands to them using one of these methods:tinc()
— start tinc with the specified arguments and returnPopen
tincd()
— same, but for tincdcmd()
— a convenience wrapper aroundtinc()
which can pass text snippets intostdin
, checks the return code for you, and returns(stdout, stderr)
with
statement), it stops every tinc and tincd started from it, andkill
s everything that refuses to stop.If you initialize instances of the
Tinc
class directly, they won't be stopped automatically.Tinc
can also register tincd scripts and returns objects that let you wait on notifications from those scripts. Only one event is needed now (and thus only one is supported), but it's easy to add more.There are some assertions (which should be used instead of the built-in
assert
because it doesn't print values on failure), a global logger, and a few helper functions for more complicated commands likeexchange
so the logic isn't repeated in several places.Known issues
scripts.py
Port checks were dumbed down to checking that a non-empty value was passed. tincd on Windows behaves really strange with dynamically allocated ports on both sides of the connection, and I couldn't make it work after wasting half a day on it.
We have three choices here:
Port 0
(and get intermittent test failures because of port conflicts)I don't think any other test functionality was lost.
invite-offline.test
andinvite-online.test
are rolled intoinvite.py
.Crashes on Alpine
Example.
SIGSEGV
in a Python process that's running the test. Since there's no unsafe code in tests, I suspect this is because of musl.These are pretty rare, just wanted you to be aware of them.
Failures on macOS
Same issue we have with the current test suite:
although tinc should definitely be able to connect to tincd because it is only started after we get a go-ahead from
tinc-up
.So it's likely a tinc bug. I don't have access to macOS, and it's not easy to trigger to debug on CI machines. Can't help much here.
c1c5263e8d8d67aec9e06a5b680d956a19d88229 is a quick and dirty fix for #361. Don't pay much attention to it. I'll look into it while you're reviewing tests.