mrc-ide / covid-sim

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London.
GNU General Public License v3.0
1.23k stars 257 forks source link

Add unit test suite #166

Closed Polarisation closed 4 years ago

Polarisation commented 4 years ago

A model of such importance should have a suite of unit tests to ensure each component of the model is robust and free of bugs.

Unit tests are best practice in software development. Whilst they are sometimes skipped due to limited resources, they should never be skipped when the software in question plays a role in determining whether people live or die.

As I understand, an earlier version of this software model has influenced public health policy. That public health policy determines how many (and when) people will die from COVID-19, and impacts the quality of life for many millions of people.

Unit tests should be mandatory for projects of this significance.

Please add a unit test suite.

cjgdev commented 4 years ago

If the project maintainers nominate a test suite and / or give the go ahead to add one then I’m more than happy to undertake this work.

spiderkat99 commented 4 years ago

A model of such importance must follow tried and tested software engineering techniques with an associated underlying process not just coded.

The addition of unit tests is insufficient in that the tests need to be based on documented and peer reviewed software requirements.

Aerospace software must show compliance to DO-178B/C, this software should have been developed to these or another set of applicable guidelines to ensure an acceptable level of quality.

Feynstein commented 4 years ago

This software is not used outside of academia when there are no pandemic going on. Also in most scientific algorithm unit tests are useless. What you need to do is automated functional tests in order to prevent regression. There is no need to add exceptions in this kind of code because it is intended to be used by experts. There is no edge case to control as the user is believed to know what he's doing. So validating edge case through unit tests is a waste of time considering the science is the most important value of these kind of algorithm. Best case scenario you use asserts.

spiderkat99 commented 4 years ago

Feynstein - I think that we are in broad agreement here. Tests against software requirements are by their very nature 'functional' and are generally referred to as requirements based tests. This set of tests will be automated and initially verify that the software meets the requirements by having pass/fail criteria. As they are automated they can be re-run after each release therefore providing the 'regression' testing to ensure that nothing has broken.

I'm not sure why you included exceptions and asserts above as these are coding issues and this thread is concerned with testing which should really be expanded to software verification as this includes a combination of testing and peer reviews.

Feynstein commented 4 years ago

Well in my scientific software experience, unit tests are mostly used to cover for edge cases in production code. Like checking all the throws. Maybe I haven't seen the full extent of their use.

phelps-sg commented 4 years ago

Unit-tests are still vital in scientific-computing work, and especially for simulation models like this. The entities in this simulation are supposed to behave in some way like their counterparts in the real-world. For example, if we shut down the schools, then people should not go to school, and should not becoming infected at school. It is possible for school attendance and school infections to still occur even if we shut the schools down, due to a bug. Therefore we should test that when if shut schools, then nobody goes to school, and nobody is infected at school.

spiderkat99 commented 4 years ago

I think that we are in agreement that there should be some type of testing but the argument surrounds what type of testing is most appropriate.

I cannot comment on testing in scientific computing as I have never worked on scientific applications but I can give you firstly some background to why I am suggesting what I called above requirements based tests or more informally functional testing.

For nearly 40 years I have worked as developer/ tester/manger of airborne safety critical software systems. I may be telling you something that you already know but if an airborne safety critical software system fails then there is a very high probability that the aircraft will crash and many will die and therefore I have always taken my work very seriously.

Here are a couple of examples of the types of systems that I have worked on. When I was working in the UK, some years ago, I was involved in the development of the software for the European Fighter Aircraft as it then was, now the Eurofighter Typhoon, I was the technical lead for 4 systems on the aircraft and also wrote the software standards and managed the software toolset.

Since moving to the US I worked on various systems on the A380 - FMS, Secondary Power System and the Thrust Reverser. Also many other systems, an interesting one to look up on Wikipedia is a FADEC.

The point of the above is to say that I am speaking from experience but not the same type of experience as you guys.

Back to unit v requirements based tests. From my perspective unit tests are written against the code and simply confirm that what you have written in the code is what you wanted in the code. This may be OK but it not sufficient to me as want to know that what I 'required' the software to do it actually does. There is a subtle but very significant difference here.

I'll be frank here, unit tests as I know them are easy to write but provide little gain as you still don't know if the software meets its requirements. Requirements is the key word here as you need a detailed set of software requirements. This is actually the hardest task in airborne software development. The coding task is somewhat trivial as compared to this and will take, generally, less than 15% of the whole development effort. In fact the testing effort normally takes more effort than the coding .

Once there is a set of software requirements the testing can begin as all test cases are based solely on the requirements and not the code. In fact the tester does not look at the code. Tools are used to ensure that code coverage at a statement branch and condition level have been achieved and also that tests trace back up to all requirements and code traces back to requirements.

Not to put a finer point on it this is a huge task for a large system. For example I managed the test team that tested an FMS system that was in excess of 2 million lines of source code.

This software is not safety critical but there is an argument that it falls into the mission critical category a whatever the UK government says the results made it change tack as far as the lockdown was concerned.

I can understand if yo do not wish to follow this path as it will be costly in term of hours. I can't put an actual cost on it as that would require a detailed exercise.

In a nutshell the difference is, do you just want confirmation that the code as written is what you intended to write or do you want to bite the bullet and see if the software really works in that it meets its requirements?

In conclusion the aircraft certifying authorities such as the FAA and EASA would not accept unit testing as being and acceptable method to use, the accompanying system would never be certified and so you can be assured that you are safe to fly after the lockdown.

xerocrypt commented 4 years ago

Unit testing is essential if the software is to play any role in determining public policy, and when lives depend on the correctness of the software. The problem is this software cannot be unit tested without extensive refactoring, since the whole point is the software should consist of discreet units with defined and testable behaviour.

insidedctm commented 4 years ago

Unit testing is essential if the software is to play any role in determining public policy

Unit tests make it easier and less error prone to maintain software. They are neither necessary or sufficient to deliver reliable software - alternatives are code review, higher-level tests as outlined by @spiderkat99 and, in the case of software that makes predictions, evaluations of forecasts against actuals. I think unit tests are important but you've overstated your case.

spiderkat99 commented 4 years ago

Steve - I think that some of the misunderstanding we have here is concerned with our different interpretation of what a unit test is as what you describe is a perfectly acceptable functional test .....

The requirement that we are testing is, informally:

'Req 001 - If we close all the schools then no kids shall go to school.'

That requirement was pretty easy, have you got some more?

spiderkat99 commented 4 years ago

insidedctm - These are not actually what I would call a high level tests but then again its a question of semantics. I also agree that all artifacts that are developed are peer reviewed as I indicated above.

Also I am quite happy to overplay my hand as I would always err on the conservative side of software development.

As an example in the past I have analyzed the output from a compiler to ensure that it doesn't generate any machine code that is not traceable to the source code.

i don't advise attempting this here !!

spiderkat99 commented 4 years ago

xerocrypt - I agree that the software as it is has too much complexity. To demonstrate this I plan to run a static analysis tool against it that will give us a McCabe complexity measure. I'll post the results.

Feynstein commented 4 years ago

Hi again, I will leave that here to further your discussion. I find it a very good tool for testing. https://github.com/catchorg/Catch2

spiderkat99 commented 4 years ago

COVID Complexity.zip

I ran the static analysis tool and here are the program complexity metrics that it generated.

The output only works properly on the PC that the tool generated the output on. The links to files won't work as they are absolute path names on my PC.

Unzip and then click on one of the HTML files.

&ABCDEFGHIJKLMNOPQRSTUVWXYZ - links will work ....

A bit of an explanation to get start

Cyclomatic - is the metric I was interested in, this is the McCabe complexity

According to the literature anything more than 10 is too complex

When I used it it was part of the Coding Standards and we set it to 12 to give us a bit of leeway. Anything over 12 had to be analyzed and documented.

Example from COVID code:

Entity | Cyclomatic | Modified | Strict | Essential | Nesting | Path Count | Path Count Log

AssignPeopleToPlaces | 112 | 112 | 132 | 20 | 15 | 999999999 | 16

Sorry about the formatting ..

The McCabe complexity = 112, I've never seen anything > 25 as far as I can remember

This function has also effectively bust the tool as it gave up when the path count got to 999999999

If there was some form of QA audit on this code that looked at McCabe this code would just be busted.

I don't know if any of you guys in the UK know of Martyn Thomas, he has advised the UK Government and the Commission of the European Union on policy in the fields of software engineering and VLSI design, if he took a look at these metrics he'd have a seizure.

Good night for now ...

xerocrypt commented 4 years ago

Unit tests make it easier and less error prone to maintain software. They are neither necessary or sufficient to deliver reliable software - alternatives are code review, higher-level tests as outlined by @spiderkat99 and, in the case of software that makes predictions, evaluations of forecasts against actuals. I think unit tests are important but you've overstated your case.

A unit test suite with adequate coverage would provide enough documentation, and make it far easier for reviewers to understand the steps involved in generating the simulation. Especially in software like this, where a minor undiscovered defect could lead to results that are way off.

spiderkat99 commented 4 years ago

Getting 'adequate coverage' is impossible with the current code, look at and understand the metrics ...

xerocrypt commented 4 years ago

Getting 'adequate coverage' is impossible with the current code, look at and understand the metrics ...

Yes, that's why making the program modular would be (should have been) a priority. Breaking the program down into testable units might be doable, but creating the tests themselves would be extremely time consuming.

spiderkat99 commented 4 years ago

OK so give us some details about your unit test strategy so that it can be evaluated. Just saying lets do unit testing doesn't buy us anything.

xerocrypt commented 4 years ago

Well, I was playing with the idea of adding a unit test for each function I add as I shift duplicate code into generic functions - there seems a lot of that in CalcInfSusc.cpp, for example. Instead of having multiple duplicate blocks, we'd have a single testable function that could be instantiated as needed. That's just one very small example. At the very least, we'd know the behaviour of whatever functions are being added - and they'd be easier to document also.

spiderkat99 commented 4 years ago

I go back to my previous comment in that your just ensuring that the function performs as you think it should but you still have no idea that the results are 'correct'. What are the tests based on, in my suggestion they are based on documented requirements that is what we expect the software to do. I've given a very simplistic example of a requirement based on another comment so you need to give a concrete example as to how a function will be tested and how that demonstrates that the software gives the correct results.

spiderkat99 commented 4 years ago

Also how do you know that you haven't missed any functions i.e. they are not in the code and so you don't test them as you haven't got any software requirements. Where did all these functions come from they can't just be plucked out of thin air?

xerocrypt commented 4 years ago

I'm referring to duplicate code that's shifted into generic functions to tidy things up, not stuff that's been added for no reason. It's just refactoring what's already there.

I'm going to email Boris Johnson the unit test scripts, so this modelling business will be clear as day. He will understand me.

spiderkat99 commented 4 years ago

Yes I understanding what is effectively chopping up the existing code into a number of smaller functions which is the way it should have been but it's still not clear how they are tested to show that at the end of the day the correct functionality has been implemented correctly..

I know this is repeat but how do you know that some functionality that is required has not been implemented. This is a really basic defect. If its not in the code then you won't test it.

spiderkat99 commented 4 years ago

I fear for Boris ... he may agree with you as he hasn't had much idea about the pandemic so far.

xerocrypt commented 4 years ago

My concern is we have sod all in the way of documentation, and a set of test scripts would at least be a step in that direction. It would provide a reference for the actual behaviour of whatever modules, even if that's nowhere near as useful as it would have been, had the tests been set up from the beginning.

spiderkat99 commented 4 years ago

I agree with you there in that whatever we actually do is too little too late.

I won't actually state publicly what i actually think about the quality of this software but from what I've written above I think that get an appreciation of what I think.

I have enjoyed this discussion but its time for bed here in Arizona but keep it coming as I like it.

hipparchus2000 commented 4 years ago

It would be easiest to write a functional test that enables a test mode where race condition in setup is eliminated by single thread mode, and RNG determinism issue eliminated by fixed number of threads and seeds for those RNGs. A subset of test data input, expected output. The test performance itself could be validated against sigmoid (gompertz) curve, and maybe some sigmoid projections where exponential expansion and exponential slowing constants change a few times in the represented time series.

cjgdev commented 4 years ago

@dlaydon, @weshinsley: Based on the comments here, I believe we have some questions for the project maintainers which would be constructive in moving the discussion along.

  1. There appears to be a consensus that additional testing above the existing regression tests would be a constructive way of improving the code. Do the maintainers have a strong opinion for or against?

  2. A more rigorous approach to behavioural testing has been proposed as a maximally productive way of improving the code by testing existing behaviour, however this requires documenting it such that this can be translated into a list of requirements. Given that the maintainers are currently operating under unprecedented pressure, is there bandwidth available to be able to produce this sort of documentation?

  3. A tool for measuring code quality metrics has revealed that there is an issue with code complexity, and if tackled in the right way then it potentially addresses issues around modularity and the ease with which the project may be maintained. Are the project maintainers in favour of the community undertaking this degree of refactoring work for the reasons given?

  4. Introducing a unit-testing framework (such as Catch2, as proposed by @Feynstein) and a suite of unit-tests is quite a large chunk of work. I propose that a good starting point might be to add a placeholder unit-test suite to CTest, and then integrate that into the CI system such that the unit-tests are run as Github Actions as a pre-requisite to accepting pull requests, plus changes pushed to any branch. This way we can break down the work of lowering complexity and improving code coverage into smaller tasks which is simpler to prioritise and delegate to individual contributors. Are the project maintainers in favour of adding a unit-test suite, in the proposed manner?

Martin-Spamer commented 4 years ago

This software is not used outside of academia when there are no pandemic going on. Also in most scientific algorithm unit tests are useless. What you need to do is automated functional tests in order to prevent regression. There is no need to add exceptions in this kind of code because it is intended to be used by experts. There is no edge case to control as the user is believed to know what he's doing. So validating edge case through unit tests is a waste of time considering the science is the most important value of these kind of algorithm. Best case scenario you use asserts.

While unit test are not appropriate for a changing model, they are very much appropriate for supporting code, IO, reused data structures, maths functions.

Well in my scientific software experience, unit tests are mostly used to cover for edge cases in production code. Like checking all the throws. Maybe I haven't seen the full extent of their use.

Boundary/Edge cases are just one area were they are useful, but unit tests are also beneficial for establishing that a happy path also produces the expected results, especially for automated regression testing. Automated Software testing is my specialism and I always start with the happy path.

zebmason commented 4 years ago

I've added unit tests in #213 although they aren't built out of the box as I don't know how to get CI to do a git submodule update

Feynstein commented 4 years ago

@cjgdev I don't think it would be very hard to add a Catch2 suite of tests parallel to the main project that calls the methods from it. Like if you add a folder /tests parallel to /src and start from scratch in there it only requires to add the catch header in this part. This is to be considered.

mattwynne commented 4 years ago

Perhaps you could use approval / golden master tests for the time being, to give others some confidence about making changes to the code without breaking things?

Basically big chunky input/output tests that use existing data to firm up what it does today.

insidedctm commented 4 years ago

@dlaydon, @weshinsley: Based on the comments here, I believe we have some questions for the project maintainers which would be constructive in moving the discussion along.

  1. There appears to be a consensus that additional testing above the existing regression tests would be a constructive way of improving the code. Do the maintainers have a strong opinion for or against?
  2. A more rigorous approach to behavioural testing has been proposed as a maximally productive way of improving the code by testing existing behaviour, however this requires documenting it such that this can be translated into a list of requirements. Given that the maintainers are currently operating under unprecedented pressure, is there bandwidth available to be able to produce this sort of documentation?
  3. A tool for measuring code quality metrics has revealed that there is an issue with code complexity, and if tackled in the right way then it potentially addresses issues around modularity and the ease with which the project may be maintained. Are the project maintainers in favour of the community undertaking this degree of refactoring work for the reasons given?
  4. Introducing a unit-testing framework (such as Catch2, as proposed by @Feynstein) and a suite of unit-tests is quite a large chunk of work. I propose that a good starting point might be to add a placeholder unit-test suite to CTest, and then integrate that into the CI system such that the unit-tests are run as Github Actions as a pre-requisite to accepting pull requests, plus changes pushed to any branch. This way we can break down the work of lowering complexity and improving code coverage into smaller tasks which is simpler to prioritise and delegate to individual contributors. Are the project maintainers in favour of adding a unit-test suite, in the proposed manner?

Was there any further discussion on this?

zebmason commented 4 years ago

@insidedctm Have been pulling various bits out of #213 and the GoogleTest stuff has morphed into #258. Some VMs need updating to a later version of CMake before this.

As to tools I like to make my own. On testing I actually prefer BDD - I don't like the overhead of Cucumber. As to code quality metrics I have my own views which I've implemented.

insidedctm commented 4 years ago

Thanks I’ll have a look at those items