mrc-ide / covid-sim

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London.
GNU General Public License v3.0
1.23k stars 256 forks source link

Some thoughts on I/O robustness #197

Open StevenWhitehouse opened 4 years ago

StevenWhitehouse commented 4 years ago

I've had a quick look through the I/O code for this project, and there are a few things that I'd recommend looking at in order to improve robustness. These mostly revolve around missing error checks in a number of locations.

Looking at binio.cpp the routines there (fwrite_big and fread_big) appear to date from a time when the c library couldn't cope with 64 bit I/O. Those loops are harmless, but shouldn't be necessary unless there is some 32 bit platform that you still need to support. That is ok in itself of course, but the bigger issue is the missing error checking. If fwrite or fread returns 0 then that is simply added to the count and returned.

A quick grep of the calling locations shows that at no point is the return value from fwrite_big or fread_big checked. So if there is an I/O issue, this will not be caught or reported, and it will have to be deduced from the effect that it has on the generated results, which is not an ideal situation!

To avoid clogging up the callers of fread/fwrite_big with duplicated error checking code, I'd suggest adding the error checking in these functions, making sure that zero return values are checked for, with an appropriate error message. Also the fread/write_big functions might as well return void, since none of the callers check this return code anyway.

In addition, the FILE* I/O functions are buffered, which means that on writes, depending on the buffer size and the size of the write, the I/O may not happen immediately, but is temporarily buffered by the C library. As such, error checking for writes also needs to include the subsequent fflush/fclose return codes in order to catch any errors when the actual write syscall occurs. One example I found was in the end of LoadPeopleToPlaces() in SetupModel.cpp. In fact that fflush call is redundant, but harmless, since fclose will do the fflush anyway.

Note also that fclose/fflush doesn't actually force the data to be written to disk, all it does is flush it from the application to the OS. You'll need to add an fsync to be sure that the data has really been written. Again, best practise is to report any errors from fsync.

While on the robustness theme, there are also a number of calls to sscanf where there is no checking to ensure that all of the arguments (or however many are expected) have been correctly assigned. Lines 75, 82, 849, 856, 862, 868, 872 are examples in SetupModel.cpp. Maybe this would be obvious from the output if something was slightly astray, but I've not looked into this in enough detail to understand the full consequences of any issues at this level. All I'm doing is a quick scan to see if I can spot any problems.

Hopefully this is useful feedback... I'm assuming also that this program is cpu bound, so that I/O speed is not critical to the overall performance. If it is more critical, then I could take a closer look and see if I can suggest some improvements there. If it is not critical then keeping things relatively simple as they are at the moment is no doubt the best policy.

Feynstein commented 4 years ago

I am working on that in my fork. Since there are like thousands of parameters I was thinking of putting every parameters in different context relative unordered map of string and std::any and add a documentation regarding the keys and their definition.

weatherhead99 commented 4 years ago

I was browsiing this also and it seems to me a simple (possibly regex) based parser for the various file types which is run in one pass, and p[uts things into an unordered_map, which can then be looked up and queried for required parameters, would be much cleaner and more maintainable than the existing ~1500 lines of code which are there just to read parameters from files.

Likely it would also be more performant (though clearly this is not a huge part of the program time), since the current approach seems to do a lot of backtracking and fseek.

Feynstein commented 4 years ago

I laid it all down in #193

StevenWhitehouse commented 4 years ago

Yes, I'd spotted that you were working on what sounds like a significant update. My only caution there is that for this kind of project, I suspect there will be a requirement for it to be easily reviewable, and in small steps to enable that. I had tried to read through a few of the previous issues that were opened too, in order to try and avoid duplication of things already reported.

There are clearly a lot of things that could be done, but in the interests of making the most impact in the least time, I wonder if there is some kind of prioritised list of things with which particular help is needed. I'd looked at the RNG code too, but it is not clear whether that is a sensible thing to dig into, and if it looks like others are already looking at that.

I started looking at that I/O side of things partly because that is something that I do regularly in my day job, so I have a reasonable chance of spotting issues there fairly quickly, and also because nobody else had yet mentioned it. It would be good though to figure out what is most important and then maybe those of us who are looking at the code can do a few things in parallel without (hopefully!) too much duplication of each others work.

Feynstein commented 4 years ago

Yes, you are right. What I think however is that this code is in dire need of a complete C++ overhaul. What I intend to do is simply last the basis for scalability of this code. I don't think I'm going to change a lot of the backend type stuff for now only the data handling and input loading

Feynstein commented 4 years ago

And work on the repeatability issue if I find a fix I do that the diff between runs is lower think it will be a great step forward