Closed jMyles closed 4 years ago
Wow. Tests are just comparing hashes of the results??
How is that possible that country-wide policies are based on that?
As a software engineer, I’m appalled at the quality of this code and the role its played in public policy. The deficits in testing and quality assurance need to be immediately spoken for to assure the claims made by its data are valid.
In a time where faith in scientific models is more important than ever, it's truly disheartening to see that such widely-used models are based on such faulty testing logic.
Replication is essential for robust validation. It's one thing for a model's assumptions to be incorrect, however if a model used for life and death policy decisions cannot replicate its calculations, it goes from being scientifically useless to being dangerously negligent.
:+1:
👍
This will be a devastating hit to the public perception of science, and I'm sorely disappointed. It has taken serious effort to get the public to trust scientists and evidence-based policy, and code like this threatens to ruin that trust. Any software engineer can get a sense of the code quality here in less than a minute. Non-deterministic results (even with the same seed!), poor documentation, and almost zero testing? I expected better from such a world-class team.
Also , any government that based their predictions on the model should openly share the data to be re-evaluated.
Right, but this is still an obvious step up from no science influencing public policy. A lack of tests doesn't ensure or even imply incorrect code.
While you're right that no science influencing public policy is a terrifying and ever-more-realistic thought, having a model such as this - one that will inevitably come out as faulty and poorly designed - will allow those in public policy to make faulty (or calculated, to push an agenda) generalizations in the vein of "See what happened with this model? This is why we can't trust any scientific models!"
This is not inherently a step backwards, but could potentially put the public's trust in science (something I still have trouble believing is even a concept) in jeopardy, especially if capitalized upon by politicians with ulterior motives.
A lack of tests does ensure that the code is incorrect. This may not be important for a game, but the code must be deemed incorrect for guiding policies affecting lives of people. It is as far from a step up from no science affecting policies as witchcraft is.
This is an embarrassing piece of work. Non reproducible experiments are NOT science. The complete non scientific nature of this project explains very well why politicians were eager to jump on board.
Proper automated testing is one of the top indicators of a professional compared to a hobbiest.
This issue doesn't reference any specific issue with the software or explain why the, in the poster's opinion, it is a problem. I suggest this issue is closed.
I totally agree. As a software engineer myself, I know we all did bad code like this at least once. But, I cannot condone the usage of such poor quality software for policy-making. Even though I can sympathize with the effort to make the original code more legible and usable, we still NEED to have the original source code for proper auditing AND we need the original input data used to render the results that have been heavily publicized in the media and used by governments. We must be sure that we can reproduce the same results from the same data, using the same code.
This is some of the worst code I have ever seen, there is no way of knowing what it is doing due to giant chunks of bad variable names and no tests. It is on par with some of the worse 1st year code I used to mark.
@insidedctm As I have pointed out, the tests don't support the actions taken by the operator of this repository, the MRC Center, in publishing this study and advocating for particular public policy.
The action to fix this Issue is to retract the studies based on it, which is within the power of the operator of the repository.
The circumstances of disposition are both unambiguous and within the reach of the repository operator. What other details of this bug are necessary for it to be well-formed?
@jMyles I can't recall any open source software that operates that way. My opinion remains - file specific issues, open PRs to improve. This issue doesn't in any way assist that so it should closed.
Can we have a list of the work that has been based on this code?
Not feature complete. It doesn’t simulate violating lockdowns to sleep with married women.
As a scientist myself, but with also a MSc in computer science, I understand that software engineers will have a different approach here. For software engineers, tests are usually automated, numerous and included in the repository.
However, bear in mind that scientists will perform many tests of their code, including many visualisations of different aspects of their implementations. These are usually thorough, performed and analyzed by multiple people. However, they may not necessarily be automated. Plausibility and cross-checking with other people's implementations are also important techniques, as are Monte Carlo simulations with known inputs (created with a separate implementation), to test whether the software recovers them correctly (a form of end-to-end test).
You cannot conclude from the lack of tests included in this repository that no tests are being performed. All we know is that such tests are not included in this repository.
Aggressive language like the title implies is frequently heard as a reason why scientists do not release their code. I suspect this is not the outcome you are aiming for here.
As a positive way forward, perhaps a more charitable issue would be to ask for expanding the test base to include the other tests the authors performed, to add more unit tests, etc. Software engineers like yourself can help tremendously here. It has also been the case in other open-source science projects that because of outside help, testing and contributions, code bases become more robust, and thus trusted, by the scientific community. So your contributions can make a huge difference.
Not feature complete. It doesn’t simulate violating lockdowns to sleep with married women.
Stop. Seriously.
People should stop thinking that this simulation is the only factor in determining a country's anti COVID-19 policy. In fact, there could be multiple groups, with their own simulations that compete with each other, but this is the only one whose authors dare to publish theirs code.
Let me stress a part of an excellent comment above:
"Aggressive language like the title implies is frequently heard as a reason why scientists do not release their code. I suspect this is not the outcome you are aiming for here."
Billions of lives have been disrupted worldwide on the basis that the study produced by the logic contained in this codebase is accurate, and since there are no tests to show that, the findings of this study (and any others based on this codebase) are not a sound basis for public policy at this time.
I don't think you understand how modelling works.
You don't test your model against the real world in some unit tests. The tests are to check you are loading and storing data correctly, moving it around the model, etc.
The real world modelling comes from the range of parameters you start off with. They are often politically- or socially- led, and in any case will only give a rough approximation of what might happen in the real world.
Really, this issue does not help the cause.
You can trust programmers to be this arrogant..
You're trying to apply software development techniques to a discipline where testing is usually done in a much different way. Yes, we should all try to be perfect and write the most magnificent code with the most bulletproof testing methods, but reality is different.
How about you write tests that clearly prove that the results from this simulation are absolutely wrong? The code is right there! And you're a "software engineer"! Then start talking about retracting papers.
I think @JohannesBuchner hit the nail on the head. I think it's valuable to call this out, but rather than tear it down, maybe we should promote a culture of closer collaboration across communities? Make some of the testing that's been done more visible, refactor the code, iterate on the documentation and automate some of what is likely being done manually.
As a software engineer, I’m absolutely elated that the source code for a simulation playing a huge role in public policy has been open sourced. The deficits in testing and quality assurance can be rectified by an enormous open source community. Truly amazing times
This issue doesn't reference any specific issue with the software or explain why the, in the poster's opinion, it is a problem. I suggest this issue is closed.
Lack of tests itself is a problem. In software engineering, we won't trust this kind of code with running a wall clock, let alone guiding pandemic response.
Wow, what a thread for a github issue. I will try and address a couple of these points, and then for the peace of us all, I will close this thread, since there's nothing in it we can directly action, and I doubt my comments will satisfy some very much. If anyone feels aggrieved by this, please open another issue, keeping it courteous, constructive, relevant to some particular code, and to the point please.
The tests in this repo are (obviously) just regression tests. They ensured that the major refactoring that went on in this code most recently did not cause deviation from a reference set of results. Then when any binary differences arose, they needed investigation and resolution before merging. I agree, unit testing is always a good thing, and we are moving towards more of that, but right now more pressing issues are at hand.
As @JohannesBuchner and others accurately and helpfully pointed out, the vast majority of science tests cannot be expressed in code in a repo. Those who want to know more about this, or see the work over the years using this model should read the peer-reviewed papers listed in the readme, to see the kind of world of testing and calibration these algorithms go through in different settings, and the challenges faced in doing so.
To those who find the code ugly/untidy difficult to understand, we have hardly claimed this to be a thing of beauty, and code in academia rarely is. The refactor has significantly helped, but variable name style at times shows the age of the code, often meaning something more obvious to an epidemiology-familiar audience. This is work in progress, and the code is here to invite constructive criticism and improvement - carefully over time. The current state of the code has not seemed to prevent those users who have so far explored it with some time and effort, and understood it sufficiently to provide us with meaningful feedback and discussion. We welcome this - but note we are also under great time pressure at the moment, as I hope everyone will understand.
Finally, we are scientists and programmers. We provide a viewpoint, among many viewpoints, and others make policy decisions on the collection of those viewpoints, among a group of scientists far wider than us. I trust some of you are endeavouring to find out all the other viewpoints being presented in open-source forums, and provide similar critique for those.
I am a professional modeller. For ten years I headed PwC’s modelling group and model review service and then for 5 years I headed IBM’s European modelling group. I must stress, however, that my comments represent my personal opinions and in no way represent the views of the organisations mentioned.
I must agree with the range of comments about the programming of the model.
First, the model has been developed with a very poor standard of documentation, either external, internal comments within the code or through the development of self-documenting code by using appropriate variable names. Seeing variables defined as:
int i, j, k, l, lc, lc2, b, c, n, nf, i2;
without any indication of what they represent, is the worst form of programming practice.
Second, there appears to be a lack of testing of the model. The inability to produce the same results when seeding the random number generators with the same values is particularly alarming, and suggests some fundamental issues with the program’s development. The absence of any documentation of an appropriate testing regime for a model of this importance is also alarming
Third, the choice of language and monolithic programming style (the main simulation module consists of over 5400 lines of code). This is not an appropriate choice for a model that will be used for making real world decisions and will need to be altered often to reflect changing requirements (such as trying out new forms of intervention).
I appreciate that the model was written 15 years ago, but it is being used for making decisions today, and it should have been updated to reflect best practice. (For an example of best practice, I recommend you look at this model https://github.com/neherlab/covid19_scenarios from the Biozentrum of the University of Bazel).
I would also like to comment on, in addition, the modelling issues rather than the programming issues associated with this model. The model consists of over 280 parameters plus a population density file for each run. I know from (bitter) experience that creating consistent datasets of 280 parameters is extremely difficult and rigorous procedures have to be followed to ensure that appropriate sets of data are created and used and they are comprehensively documented and linked to both the version of the model used and to the outputs that are produced. Best practice would use data curators for each run who are responsible for the collection and documentation of the inputs used. I have no idea whether such procedures are followed, but the lack of documentation associated with the dataset supplied does not engender confidence.
This leads to a wide question. Why is it that professional modellers are not represented on the SAGE modelling group. https://www.gov.uk/government/publications/scientific-advisory-group-for-emergencies-sage-coronavirus-covid-19-response-membership/list-of-participants-of-sage-and-related-sub-groups
There are a number of professional organisations such as the OR society https://www.theorsociety.com/ , or the Institute and Faculty of Actuaries https://www.actuaries.org.uk/ who could have been approached to provide modellers whose professional experience consists of building models on which critical real world decisions are made. The group seems to consist entirely of academics, who are, perhaps, unused to their models being used to make real world decisions and the attendant requirements on the model development and use. This lack of diversity is disappointing.
Google "define: rigour"
@JohannesBuchner I understand your point and I think you're right and we should rather help improving that just criticise. One of the pillars of the scientific method is the possibility to replicate results and perhaps this is not possible with "classic" (as a developer) unit test or integration test approach, but as far as I am aware, no study or model is accepted from the scientific community unless is first peer reviewed. Continuing the parallel between scientific literature and computer science, I like to see peer review as the QA of a scientific publication (based on a mathematical model in this case). As of now there's have been no peer review (unless I'm behind) whatsoever, so I understand the initial criticism of @jMyles. This is not a project that has been submitted to the open source community to be improved for subsequent scientific modelling, this code has had a huge impact on millions of people's lives and then has been submitted to the open source community, therefore, as the author of such project, you should expect criticism.
The inability to produce the same results when seeding the random number generators with the same values is particularly alarming, and suggests some fundamental issues with the program’s development.
(At risk of sounding repetitive), submit an issue reporting your observations so we can work on what you have observed. The code, as we run and test it, for given seeds, is deterministic when run single-threaded. Post-initialisation, it is deterministic for a given number of threads. There is one section of SetupModel.cpp function, well-discussed in other issues, which can produce statistically equivalent stochastic realisations, but not binary identical ones, using multi-threading. If determinism is required there, commenting out one line can achieve it.
Please state why you believe otherwise, or retract those comments as stating it in this way is not helpful to anyone.
Carmack approves https://twitter.com/ID_AA_Carmack/status/1254872368763277313
Hey github devs- it would be great if i could just tag everyone who reacted in certain ways (not sure i trust current sentiment analysis, so limiting it to people clearly in support of bullshit positions might be fine) to some issue as like, people whose opinion means nothing to me. Thanks- Chris.
We are extremely grateful for John Carmack's input to the project, and his presence in many of our discussions. We view his comments in Twitter about the code as encouraging testimony, in which he is rather more positive about the code than some of the commenters on this thread.
(Own up, who has just googled for who John Carmack is ;-) )
Regarding peer reviews, again, at risk of sounding repetitive, read the peer reviewed papers - there is a section on the front page readme, and they are very well known prominent papers in epidemiology, covering the scientific algorithms you will want to know about, to understand what this code does and why.
As a positive way forward, perhaps a more charitable issue would be to ask for expanding the test base to include the other tests the authors performed, to add more unit tests, etc. Software engineers like yourself can help tremendously here.
There is very little I could contribute to this code base now that will undo the potentially negligent side-effects (if that's the case) of its data into the realm of public policy. I'm sure there are plenty of engineers who would be willing to contribute solid deliverables had they known that they could have helped. Instead, what we have here is a small unit of academia working autonomously and the product they use to assist public policy decision-making clearly lacks the quality assurance we would expect of any organization placed into a role of this importance.
Statements such as:
Aggressive language like the title implies is frequently heard as a reason why scientists do not release their code
and
To those who find the code ugly/untidy difficult to understand, we have hardly claimed this to be a thing of beauty, and code in academia rarely is
are being used for grotesque hand waving of rigor and professional responsibility.
I find this to be exceptionally sad. Why do scientists (as you call yourselves) get to be free of the criticism for not meeting industrial expectations? All these statements say to me is that these scientists could do better, but aren't attempting to. Don't place the blame on engineers laying critique to your work; there are plenty of engineers who would be willing to work with scientists in open source formats to ensure the quality of this work, especially in the context of public policy.
This project clearly and evidently lacks the rigor that the public deserves when they are told this information is being used in policy.
The inability to produce the same results when seeding the random number generators with the same values is particularly alarming, and suggests some fundamental issues with the program’s development.
(At risk of sounding repetitive), submit an issue reporting your observations so we can work on what you have observed. The code, as we run and test it, for given seeds, is deterministic when run single-threaded. Post-initialisation, it is deterministic for a given number of threads. There is one section of SetupModel.cpp function, well-discussed in other issues, which can produce statistically equivalent stochastic realisations, but not binary identical ones, using multi-threading. If determinism is required there, commenting out one line can achieve it.
Please state why you believe otherwise, or retract those comments as stating it in this way is not helpful to anyone.
That was a comment on issue 116 raised by the Edinburgh team. I agree it does not apply when run single threaded, and retract the comment for the single threaded case.
That was a comment on issue 116 raised by the Edinburgh team. I agree it does not apply when run single threaded, and retract the comment for the single threaded case.
Issue 116 was a regression that was introduced, spotted by private collaborators, and fixed within a 4-day cycle with the issue closed 12 days ago prior to the public release. This is all very clear from the issue. so I remain confused as to why you felt it necessary to raise with such alarm today.
Your comment still remains misleading, since multi-threaded determinism holds for given seeds and thread counts, except for one well-discussed caveat in how initialisation network files are created. See issue 161 for a detailed discussion that includes the exact combinations of line numbers where that can occur, and how to avoid it if a deterministic instance of a synthetic population is required. It is a clearly understood issue. If your personal observation differs, then please submit a constructive issue.
Aggressive language like the title implies is frequently heard as a reason why scientists do not release their code. I suspect this is not the outcome you are aiming for here.
Or maybe it will lead to scientists not writing this worthless code in the future, which is an excellent outcome. Would you not be outraged if this was the code running nuclear safety software? Scientists are not magical beings that are above the rest of us. Their code must be held to the same standard as anyone else's, and if they refuse to meet those standards, then maybe we should question whether they are in fact engaging in science at all.
The reason I raised it today, was that this was an issue that had not previously been identified by the development team. My concern was not whether it has been corrected, but why it had not been previously identified in testing. I stand by the fact that the testing regime should have identified such an issue and the fact that it was not, raises concerns about the testing that was applied.
I am happy to help in any way possible, although I suspect my particular skillset will be of most use in reviewing the modelling and data management processes used in the production of the outputs rather than detailed code inspection.
The testing regime did identify and fix the issue.
The tests in this project, being limited to broad, "smoke test"-style assertions, do not support an assurance that the equations are being executed faithfully in discrete units of logic, nor that they are integrated into the application in such a way that the accepted practices of epidemiology are being modeled in accordance with the standards of that profession.
Billions of lives have been disrupted worldwide on the basis that the study produced by the logic contained in this codebase is accurate, and since there are no tests to show that, the findings of this study (and any others based on this codebase) are not a sound basis for public policy at this time.
I want to be clear that this Issue is not meant to denigrate the authors of this code - we've all written code that isn't our best work and code that is untested. But when a codebase is used to craft scholarly publications that are in turn used to influence public policy, the authors of those publications (and ultimately policy) need to ensure that the science is verifiable in a public sense. The lack of tests makes that an impossibility. So closure of this Issue, by retraction of studies based on it, is meant as a critique of the publication and policy authors, not the contributors to this repo.
(Note: In addition to the :+1: emoji, it's probably sensible to "sign" as a comment with your username, as it becomes impossible to see the emoji-ers above a certain threshold).