matrix-org / mjolnir

A moderation tool for Matrix
Apache License 2.0
330 stars 56 forks source link

Tests should not be fragile glue. #286

Open Gnuxie opened 2 years ago

Gnuxie commented 2 years ago

Last week we had a systemic CI failure because of https://github.com/matrix-org/mjolnir/issues/281 and several tests (abuseReportTest, makeAdminCommandTest, roomMemberManagerTest (for the since command)) failed in very unclear ways.

This was because not only did the rely on the implicit behaviour of Synapse making new rooms public, but the event listeners further along in the tests that are used to detect certain events doing a round trip (e.g. power level changes after invoking a command) also relied on an "implicit temporal link". I don't know how to describe this formally, but essentially we weren't waiting for the event listeners to detect the round trip before carrying out seemingly unrelated assertions. The general quality of the tests involved was also just poor for straight forward reasons (massive lexical scope, masses of imperative glue code)

the since command was by far the worst for this and I was close to reverting it entirely.

I still don't understand how the roomMemberTest is supposed to work, at all. https://github.com/matrix-org/mjolnir/blob/main/test/integration/roomMembersTest.ts#L286-L295

https://github.com/matrix-org/mjolnir/pull/219/files#diff-4e8bbc9dde21b7b895e0c081d2e3375c8958c51a10442497dfffb677f0d59a1aR1-R107 https://github.com/matrix-org/mjolnir/pull/238 https://github.com/matrix-org/mjolnir/pull/238/files#diff-865e52ce48a7c93d2dd6886810e49e5b01d3d723f06e5c7abae54913579f3014R114 https://github.com/matrix-org/mjolnir/blob/main/test/integration/abuseReportTest.ts#L28-L197

One way of solving some of these problems (especially in regards to less trivial things such as "what happens when a test times out", "how should i wait for Mjolnir to send some event?") would be to create a testing guide that could be referred to vigorously by both the author of tests and the reviewer.

ShadowJonathan commented 2 years ago

Interestingly, the https://github.com/matrix-org/mjolnir/issues/281-related CI failings did not exit the test runner with 1, but let it hang instead. When running a few smaller tests locally, i observed the test runner only exiting after finishing a sync. It is possible that we're not properly cleaning up the test runner after some time, or not have a timeout on some operations, that allows it to hang like this.

For example, it caused some tests to keep running for 6 hours before github terminated the runners themselves.

Gnuxie commented 2 years ago

I believe this was because of loops like this. The loop can just be read more easily as something like:

       while(!roomIds.each(roomId => this.mjolnir.protectedRooms[roomId])) {
           await new Promise(resolve => setTimeout(resolve, 1_000));
       }

basically the protectedRooms would never include all of the roomIds because Mjolnir was never going to be able to join them. When a test times out in mocha, the test doesn't stop running (and I'm not sure whether it's even possible to change that?) which led to the hanging. I think this probably happened with several tests.