Rewrite e2e tests into 'unittest' format

(This is a refactoring/cleanup task.)

The end to end tests currently send off a set of HTTP requests in a particular order, and each test depends on the previous one succeeding.

This has a couple of issues:

Tests aren't very readable

It's not super obvious what a line like this does:

        ['retrieve programs after deleting saved program', 'get', '/programs_list', {}, {}, 200, retrieveProgramsBefore],

For example, what do the {} in there mean? What does retrieveProgramsBefore do? Why is it there?

Tests aren't self-contained

For example, the following 2 tests:

        ['valid signup', 'post', '/auth/signup', {}, {'username': username, 'password': 'foobar', 'email': username + '@e2e-testing.com'}, 200, successfulSignup],
        # ...
        # 10 unrelated lines follow
        # ...
        ['valid login', 'post', '/auth/login', {}, {'username': username, 'password': 'foobar'}, 200, setSentCookies],

Depend on each other. The first one creates a user, and the second one tests that a login with the user that was created before. The fact that this relationship exists is not clear, however. The following would be a more readable of the same test:

def test_login_works():
  # GIVEN
  user_exists(username, 'foobar')

  # THEN
  login(username, 'foobar').assert_successful()

(Or something similar, just making up an API here)

Because tests aren't self-contained, we also can't choose to run a single individual test--we must always run the entire suite.

Progress and error reporting isn't standardized

If tests were in a standardized format, we could use standardized test runners like py.test to run individual tests, and have nicer error and progress reporting.

The purpose of supporting code will become more clear

The e2e_tests.py file currently has def getProfile1() through def getProfile5(), with what looks like complicated logic inside. It's not clear what their purpose is, because the logic is far away from the actual test case that it supports.

Here's a good thing about the current test setup that we should probably try to maintain (or at least evaluate if it's necessary or not): they don't duplicate work, and will therefore be fast and there are some assumptions that can be made about state.

A user that gets created at the start of the test run and can thereafter be assumed to always exist, doesn't need to be created twice or recreated. Compare:

def test_login_works():
  # GIVEN
  user_exists(username, 'foobar')

  # THEN
  login(username, 'foobar').assert_successful()

def test_saving_programs_works():
  # GIVEN
  user_exists(username, 'foobar')
  login(username, 'foobar').assert_successful()

  # WHEN
  save_program('myprogram', 5, 'print hallo')

  # THEN
  this.expectEquals(list_programs(), ['myprogram')

May have some issues that need to be resolved:

Do we create the same user for both tests? If so, the test that goes second will fail, as the user already exists.
Do we create a new user for each test? If so, then first of all the tests will be slower since every test is going to have to create a user, and also we will accumulate a whole bunch of test users in our database, which will all need to be cleaned up.
If, on the other hand, we try to save user accounts and test_saving_programs_works reuses an existing user account, and some other test gets added that ALSO tests something around saving programs, then list_programs() may all of a sudden start returning 2 or more programs instead of the one the test is expecting.

So there are things that the solution to this would need to take into account and have a good answer for.

Unfortunately, I don't know the best solution myself yet, but I'm sure someone can come up with something clever here 😊

Hi @rix0rrr ! Thank you for bringing up this topic. These are all valid points. Since I'm the one that got those e2e tests written in place, I might have some explaining to do 😅

The existing test structure is quite unorthodox, but it has a certain structure. The basic outline is that each test maps to a single HTTP request, and is written as a list. The shape is [label, method, path, headers, body, expectedCode, afterFunction], the afterFunction being optional.

Despite this (undocumented) regularity, all your criticisms are valid. I've used opaque names for many of the afterFunctions (such as getProfile5) and that's just plain lazy. Looking back on these particular functions, their operations are probably too long/specific to be put in a camel cased name, but they could be summarized in a comment at their top. For example, getProfile5 checks for the modification of programming experience in the profile after a used has marked that (s)he has no programming experience. At the very least I could put those comments at the top of each afterFunction. In general, almost all of the code in those functions validates the bodies returned by the server. They also sometimes set state to be used by following functions.

Functionally speaking, I also believe it's a must to have a modular test suite, rather than a whole indivisible. I'm actually now in the process of converting a monolithic suite of e2e tests into a modular one, and it's been totally worth it. Some things I learned that we could apply here:

It's OK to create and then destroy users at the beginning and end of each module. For running that module itself, there's no performance penalty. And when combining test suites, the only cost is that the tests take longer to run.
In general, all modules should do their own cleanup after they're done, so that the side effects they generate do not affect other modules. This does start to feel less like a E2E and more like a unit test of a section of the API, and it may well be. In practice, the carrying down of state/side effects from a test to another one in the distant future just makes for murky tests and it's easy to lose track of what you're really testing.
An alternative is to make an exception just for auth and create a bunch of users at the beginning of the test, and then make each module require the auth module of the tests. However, this should include storing the cookies for each login, in order to reuse them. With the clean-slate approach, most of the time spent by auth is actually spent in logins, not signups, because some interactions require a lot of back and forth (for an example of this, see the teacher/student interaction regarding classes).

I think that, as you mention, it could vastly improve readability if we could group tests in functions that contain a set of requests. Since Python is synchronous by default, this is a much more palatable proposal than in Javascript. We would still need several functions to test (say) the auth functionality, but they could be grouped logically. Whether to consider a function as a test module, or to consider a bunch of functions a module instead, is a good question.

There's two existing features from the current test I do suggest we keep: 1) the stopping of the test suite at the first error, to ensure we don't ignore warnings or even serious errors; 2) a preservation of the exclusiveness of interaction with the API, without using any backdoors to put things in the database. Sometimes it's really annoying to do everything through the API, but in the end it ends up being worth it.

If you have a particular test runner in mind, I could start by creating an alternative suite of tests for a particular module of the app (perhaps auth would be a good one to start) and we could continue the conversation over some refactored code. Would we still be using the existing code for triggering HTTP requests, or do you envision replacing that as well?

Cheers!

I agree with what you're saying.

1) the stopping of the test suite at the first error,

This is typically a feature of the test runner, and we'll get this for free. For example, if we use pytest as the test runner, we can pytest -x e2e_tests.py and the -x makes it stop at the first error.

2) a preservation of the exclusiveness of interaction with the API, without using any backdoors to put things in the database

Yep, I agree.

If you have a particular test runner in mind

We can separate out "test runner", "test definition framework" and "assertions library". The "test runner" is the least interesting of the bunch, since it is only a "UI" over the rest, and people can pick their own. They can skip the test runner in favor of their IDE or a different runner of their choice. The choice we'll make here will affect what the build server uses (and how it prints output and when it stops on error), but that's about it.

So I would say:

Test runner: pytest

For test definition framework (how do you define a test to run) and assertions library (how do you write assertions): pytest does come with some new capabilities of its own, but I'm thinking that for maximum understandability with other people who might be working on this code base, we should probably just be sticking with the unittest library that comes with Python 3 by default. It's not the most succinct, and it doesn't have the most elaborate set of assertions, but it's definition good enough and it will be familiar to most people. So I would say:

Test definitions/assertions: unittest

The biggest challenge is going to be coming up with a set of helper functions that make it easy to express what is happening in a test, while hiding away irrelevant details. The goal should be: any function should take the bare minimum of arguments necessary to express an action from the USER's point of view, and implicitly take care of all the rest.

We should pay extra close attention to lifetime of state and variables. For example: a browser session is going to maintain cookies. We should take care to start with a clean cookie jar at the start of each test, so that for every new test it will look like a fresh browser session and they don't (accidentally or otherwise) depend on tests that have run before and left a cookie in there. If we don't pay attention, what's going to happen is that the test is going to succeed when run as part of the suite, and fail when run individually (or vice versa), and that's going to drive us insane.

Would we still be using the existing code for triggering HTTP requests, or do you envision replacing that as well?

The mechanisms in the code are good! It's just that there is now one API which needs to handle all cases, which makes it intimidating to work with. I'm thinking we should probably have different APIs for different cases.

I think for maximum understandability, we should encourage tests to be written in given/when/then style.

Given: from the point of view of the user, describes the context in which the test is valid (assumptions). This describes a state.
When: describes the action the user takes. This describes an action.
Then: describes what needs to be true after the action has happened. This describes a state.

For example, a given-when-then test would look like this:

class TestLoggingIn(HedyTestSuite):
  def test_logged_in_user_can_load_my_programs_page(self):
    # GIVEN
    self.assume_logged_in()

    # WHEN
    webpage = self.load_page('/profile/programs')

    # THEN
    webpage.assert_success()

(* In this code, the lines are so trivial that the # GIVEN comment lines maybe are superfluous, and if so we can leave them out).

This is reads very clearly: "assume we are logged in. We can then successfully load the 'programs' page.", but how do we even implement this? Assumptions aren't executable. The simple trick is that even though assume_logged_in is named from the point of view of the test and describes an assumption, it would execute the necessary steps to make the assumption true. So in this case, it would go through a login action (and perhaps even a register_user action if necessary). After the function finishes execution, the system is now in the state that the function name describes, and the test can proceed.

I'm guessing the class that makes this possible looks somewhat like this:

class HedyHttpTestSuite:
  """Base class for API tests."""
  def setUp(self):
    self.username = None
    self.http = requests.Session()

  def assume_logged_in(self):
    self.username = 'testuser-' + str(random.random())
    self.register_user(self.username, 'password').assert_success()
    self.login(self.username, 'password').assert_success()

  def login(self, username, password);
    return self.send_form('/auth/login', {
      "username": username,
      "password": password,
    })

  def load_page(self, url):
    return Webpage(self.http.get(url), self)

  def send_form(self, url, form_data=None):
    return Webpage(self.http.post(url, data=form_encode(form_data))

class Webpage:
  """A wrapper around a requests.Response with convenience methods."""
  def __init__(self, response, test):
    self.response = response

    # self.test only exists so that we can access 'test.assertEquals()'
    self.test = test

  def assert_success(self):
    self.test.assertEquals(200, self.response.http_code)

Obviously, the actual creation of users might be different: we might remember ones that we've already created and not make a new one for each teest.

By the way, just found out that we can also test Flask apps without going over ACTUAL http: https://flask.palletsprojects.com/en/2.0.x/testing/

Hi @rix0rrr ! Thank you for the super clear guidelines. I'll be working on it and keep you and @Felienne updated. I'll start with the auth, since we need those functions anyway and try to open a PR as early as possible, to get feedback on the overall structure and approach. Cheers!

HI @rix0rrr & @Felienne ! I just committed an initial version of the refactored E2E tests, for early feedback. The code is here: https://github.com/Felienne/hedy/blob/e2e-refactor/tests_e2e.py

Some notes:

You can test it by checking out the e2e-refactor branch and running pytest -q tests_e2e.py.
The file is currently at the root of the repo for convenience but I plan to move it to tests soon.
I used pytest as a runner and unittest for assertions.
I separated the functionality for performing HTTP requests and assumed as little as possible on that function.
I created helpers for auth. I also used a global dict for holding user information that can be reused (if necessary or convenient) from request to request.
The existing refactored tests cover signup, login and verify.
I would next tackle something that requires already auth but is not strictly related to auth, to see how that can leverage the auth helpers.

All feedback is appreciated!

hedyorg / hedy