Add unit tests to the project

lkollar commented 5 years ago

At the moment a large portion if the code base is not covered with unit tests. There are quite a few integration or functional tests which run the entire application to validate the overall behaviour, but ideally we should have full unit test coverage around the whole codebase.

Once this is done we can enable a coverage service on the repo to make sure the coverage level doesn't drop during contributions.

I plan to keep this issue to track the PRs needed to add unit tests. It's probably best to break this into separate contributions to ease reviews.

Files mentioned in #126 should be covered by the vendored projects' own unit tests.

lkollar commented 5 years ago

Let's reopen this as I intend to keep for the rest of the unit tests.

lkollar commented 5 years ago

I wanted to address @takluyver's request on the overall plan for testing in this repo as part of this issue.

I should have probably better explained what my intention was with the plan to add unit test coverage. When I first started contributing to auditwheel I found it difficult to validate the changes I made as the test suite is very high-level, for the most part exercising the entire application -- it is quite difficult to make a small change to a part of the application and test just that part in isolation, rather than spinning up Docker containers and installing stuff with pip. It should be much easier to validate simple changes and I think that adding unit tests will help contributions as a result.

I'm sure the other maintainers are also aware of concept of the testing pyramid but just for completeness I wanted to sum up the idea here: separating higher level tests which exercise the entire application from unit tests which validate individual components provide multiple benefits. The most important one in my view is that it speeds up the feedback cycle when working on the code. Given that a large part of auditwheel's test suite runs in Docker and already takes a long time to run I think it is very important to start creating unit tests which run faster and make debugging test failures easier.

How exactly unit tests should be implemented is very subjective and I agree with @takluyver that we should all agree on the approach (assuming we can agree that we need unit tests in the first place). I don't think that 100% code coverage is necessary, but I think that the program's behaviour as observed by the user should be validated, e.g. if the tool is executed on a non-Linux system, a unit test should validate that auditwheel fails with an error. While I agree that the test can become coupled with the implementation, especially when this is simply a platform check, this will also document the behaviour and enable future contributors to retain the behaviour as observed by users today. One could say that this is easy to achieve by looking at the code, but I think this is a subjective choice and following this principle we will make bad decisions where a missing test will lead to bugs. By not making this choice and always testing these observable behaviours, we have a few extra lines of test code we have to maintain, but we can be confident that we won't introduce bugs as a result of a subjective decision whether it makes sense to add tests or not.

As an additional data point, we can already see contributors adding unit tests as seen in @di's recent PR. I found this test much more useful to help me understand the change than the usual tests which involve an entire auditwheel invocation inside Docker (which is also useful on its own of course).

So to sum up I think we have the following decisions to make:

Should we keep separate unit and integration tests?
Should we aim for 100% code coverage? If not, what should be the criteria when deciding what needs a test and we can be skipped?

Calling @takluyver @mayeut @auvipy to the discussion. Given that we are relatively new maintainers on the project @njsmith might want to weight in as well.

takluyver commented 5 years ago

Thanks @lkollar. I do think that it's worth having both unit and integration tests, although I see them more as labels for a gradient of tests than distinct categories.

My proposed way of thinking about when tests are useful is like this:

A test or a test suite is a prediction tool. We want to know "does this code have bugs?" (All code has bugs, so read that as 'bugs we care about'). Ideally the test suite fails if there are bugs, and passes if not. But it can also have false negatives - passing when there are bugs - and false negatives - failing when there aren't, either because of randomness or because the test is wrong.

There are lots of tools and techniques for reducing false negatives. Coverage measurements are one: your test can't find bugs in code it doesn't run. But a good prediction tool should also have a low rate of false positives. I.e. when your tests fail, that should be because there's a bug, not because the tests are broken.

So, when considering a new test, I try to imagine what could make the test fail. These could be bugs that are introduced which the test identifies, or problems which need a change in the test, or random external conditions (e.g. the test tries to make a network connection to a server which could go down). A useful test is one where the true positives are at least as plausible as the false positives.

Taking the proposed platform check test in #169 as a specific case:

False positives (test needs changing)
- auditwheel gains support for running on Mac
- platform detection switched to another mechanism, e.g. the platform module.
True positives (bug found)
- ?

I'm genuinely trying to come up with a plausible true positive case, but I can't see one. As far as I can tell, if that test fails, there's a nearly 100% chance that we change the test, not the code it's testing. So it doesn't meet my criteria of a useful test.

You might argue that it's very unlikely to ever fail, so what's the problem? Maybe this bit of the code will never change. But if we add hundreds of tests asserting on every detail of what the code currently does, then changing anything will cause test failures. We could update failing tests whenever we change anything, but makes every change more work. And the tests can't be a good indicator of what we care about preserving if we're always having to change them.

As one rule of thumb, I think any test which uses mocking/monkeypatching has plausible false positives - e.g. if you're mocking out argparse objects, the test would fail if the code switched to another CLI tool like click. So tests which need that have a higher bar to clear in terms of plausible true positives.

mayeut commented 5 years ago

I do think that it's worth having both unit and integration tests, although I see them more as labels for a gradient of tests than distinct categories.

Same here

Should we aim for 100% code coverage? If not, what should be the criteria when deciding what needs a test and we can be skipped?

Keeping coverage at 100% whatever the cost is counterproductive for sure. In the end, it might be a subjective criteria where we have to weight pros and cons of a given test (true positives and false positives are just a part of pros and cons). If I take the same example as @takluyver, I'd add one true positive:

the check has been removed for no good reason, went unnoticed during review.

If I take a look at the false positives mentioned,

auditwheel gains support for running on Mac
platform detection switched to another mechanism, e.g. the platform module.

my subjective opinion is that a test should be added. The true-positive outweighs the false positive in this case even if one can argue that it's not a 'bug we care about'. That being said, I wouldn't write it the way it's written parsing a human readable string (as already mentioned in the comments). Things 'we don't care about much' should have very simple tests and the way the test is written should be as little intrusive as possible and should weigh in the decision to include the test or not.

This is my 2 cents and I'm sure that doesn't help much to define a clear target.

lkollar commented 5 years ago

I'll try to sum up my conclusions here from the discussion:

100% coverage is not useful enough to offset the churn due to the extra tests which need constant updating. With #170 we will have Codecov integration so we can easily track how the coverage numbers move and see what parts of the code base miss tests. I think this is a step in the right direction and as long as we can make sure we don't have large and important parts of the code base missing tests, we don't need 100% coverage.
It seems that don't have clear consensus on what should be tested and this will probably remain a subjective choice. I still think that attempting to predict true positive test failures is futile (if humans were good at this we would have few or no bugs at all), but I do agree that we should thrive to keep the tests decoupled from the implementation as much as possible.

So on my part I will not attempt to keep the 100% coverage in future PRs and try to make the unit tests decoupled from the implementation as much as possible. I'd still like to make sure that unit tests do not depend on their environment, so some amount of mocking will be still necessary.

I also think that refactoring the code base will help reduce coupling in tests, but I wanted to first build up the unit test coverage and only start refactoring afterwards.

pypa / auditwheel

Add unit tests to the project #163