matrix-org / sytest

Black-box integration testing for Matrix homeservers
Apache License 2.0
73 stars 55 forks source link

tests/50federation/10query-profile.pl flaky "Gateway Time-out" #362

Open leonerd opened 7 years ago

leonerd commented 7 years ago

10query-profile.pl seems to be somewhat unreliable, sometimes fails with

Running tests/50federation/10query-profile.pl...
  Testing if: Outbound federation can query profile data... FAIL:
 | HTTP Request failed (504 Gateway Time-out) at ./run-tests.pl line 542.
 +----------------------

Moreover, if it fails in this manner it fails immediately claiming that timeout, rather than starting up, waiting for a bit, then times out.

Further investigation shows that if you run that one test on its own it seems to quite reliably pass:

$ ./run-tests-in-tox --dendron=../dendron/bin/dendron --haproxy tests/50federation/10query-profile.pl 
Running tests/50federation/10query-profile.pl...
...
All tests PASSED

Moreover, various combinations of tests before it seem to indicate that it is server-startup timing related. By including or omitting various other parts of the test suite before it, if it manages to get to this test file within 1m40 of starting sytest, this test script always passes. If however it hasn't got there before at least 2m06, then this test script always fails.

This also might suggest why this test seems to be dendron/haproxy-related, at least on my laptop, because it has always taken longer than this time limit to get there (and hence fails) if running via haproxy+postgres, but gets there much sooner (and hence passes) if running via in-memory sqlite.

leonerd commented 7 years ago

A most exciting "smoking gun" from this failure is the hyphen in the spelling of Gateway Time-out. That exact spelling doesn't appear anywhere in sytest, and only appears once of interest in the synapse codebase:

leo@shy:~/src/matrix/synapse [git]
$ findpython | grepin "Gateway Time-out"
./env/lib/python2.7/site-packages/twisted/web/_responses.py:109:    GATEWAY_TIMEOUT: "Gateway Time-out",

Though it could potentially be coming from haproxy also.

I may have a go inserting some long sleeps in places to see if I can trigger it from a minimal test case.

leonerd commented 7 years ago

An interesting appearance of the time 2 minutes (being 120 seconds) appears at

./synapse/federation/federation_client.py:83:            expiry_ms=120 * 1000,

However, having adjusted that to be 5 minutes (300 seconds) it made no difference. So I don't think it's that.

leonerd commented 7 years ago

Additionally, adding a skip guard into the test which just avoids it if it runs later than 2 minutes after server startup just moves the problem; doing that causes 50federation/11query-directory.pl to fail in the same manner, which previously would have passed if this one failed.

leonerd commented 7 years ago

Also this appears to be haproxy-related. If I run in single-process mode without dendron/haproxy but still using postgres, it takes (e.g.) 152 seconds to get to that test but it passes, whereas via haproxy it would have failed even if it got there sooner than that.

leonerd commented 7 years ago

Having added a silly "sleep 2m10" test before the flaky one and reducing it to a minimal test case, I get:

Via haproxy:

$ ./run-tests-in-tox tests/10apidoc/03events-initial.pl tests/50federation/10query-profile.pl  --dendron=../dendron/bin/dendron --haproxy
Running tests/10apidoc/03events-initial.pl...
 # Clearing Pg database synapse on 'localhost'
...
Running tests/50federation/10query-profile.pl...
  Testing if: Sleeping for 2m10... PASS
 # Presuming ability 'can_get_displayname'
 ** Running test after 146.469178915024 seconds from startup
  Testing if: Outbound federation can query profile data... FAIL:
 | HTTP Request failed (504 Gateway Time-out) at ./run-tests.pl line 542.
 +----------------------
 # Presuming ability 'can_set_displayname'
  Testing if: Inbound federation can query profile data... FAIL:
 | fixture failed - HTTP Request failed (504 Gateway Time-out)
 +----------------------
 # Killing synapse servers 

Whereas directly:

$ ./run-tests-in-tox tests/10apidoc/03events-initial.pl tests/50federation/10query-profile.pl 
Running tests/10apidoc/03events-initial.pl...
 # Clearing Pg database synapse on 'localhost'
 ** DBI connect('dbname=sytest_template;host=localhost','leo',...) failed: FATAL:  database "sytest_template" does not exist at lib/SyTest/Homeserver.pm line 88.
 ** NOTICE:  drop cascades to constraint application_services_regex_as_id_fkey on table application_services_regex
 # Generating config for port 8800
 # Starting server for port 8800
 # Connecting to server 8800
 # Connected to server 8800
  Testing if: GET /events initially... PASS
  Testing if: GET /initialSync initially... PASS
Running tests/50federation/10query-profile.pl...
  Testing if: Sleeping for 2m10... PASS
 # Presuming ability 'can_get_displayname'
 ** Running test after 134.930176973343 seconds from startup
  Testing if: Outbound federation can query profile data... PASS
 # Presuming ability 'can_set_displayname'
  Testing if: Inbound federation can query profile data... PASS
 # Killing synapse servers 

Which suggests this is haproxy-related.

(The 10apidoc/03events-initial.pl test is required in order to trigger sytest into creating the SPYGLASS_USER fixture early before that 2 minute sleep, otherwise that is the part that fails due to a timeout.)