numenta / nupic.core-legacy

Implementation of core NuPIC algorithms in C++ (under construction)
http://numenta.org
GNU Affero General Public License v3.0
272 stars 276 forks source link

Corrupted model images result from many routines that write/flush/save to file and don't propagate exception on error #868

Open vitaly-krugl opened 8 years ago

vitaly-krugl commented 8 years ago

I/O errors when saving to files and loading from files must trigger exceptions that users of the code can catch and react to. Users include python nupic clients that access nupic.core through swig bindings.

For example, htmengine (from numenta-apps) has logic that saves the model to a "scratch" directory first, and if the save is successful, then links to it from the permanent model checkpoint entry. This mechanism is designed to prevent a bad model checkpoint from replacing a previous good model checkpoint. However, if the underlying code (e.g., nupic.core) doesn't raise an exception when an error occurs during saving of the model, it undermines htmengine's ability to protect the integrity of saved model checkpoints.

If you git grep close in nupic.core/src/nupic/, you will find a number of functions and classes that save the respective object to a file as part of serialization of one form or another. What most of them have in common is that they don't propagate write/flush/close errors as exceptions. This leads to corrupted images of serialized objects that are silently ignored.

Also, some of them only close the files in destructors, so that even if the close (which performs the final flush) were to raise an exception, it would be suppressed by the C++ runtime. So, flushing/closing needs to be performed before the destructor in order to detect and communicate the error to the user.

One example of when this happens is when a disk is running out of space. Subsequent attempts to load from such images result in errors due to the corrupted images. You might get errors like these when loading from such corrupted images:

...te-packages/nupic-0.3.4-py2.7.egg/nupic/research/TP10X2.py", line 256, in loadFromFile
    self.cells4.loadFromFile(filePath)
  File "/opt/numenta/anaconda/lib/python2.7/site-packages/nupic.bindings-0.2.1-py2.7-linux-x86_64.egg/nupic/bindings/algorithms.py", line 2108, in loadFromFile
    return _algorithms.Cells4_loadFromFile(self, *args, **kwargs)
RuntimeError: ASSERTION FAILED: "invariants()"
...ite-packages/nupic-0.3.4-py2.7.egg/nupic/research/TP10X2.py", line 256, in loadFromFile
    self.cells4.loadFromFile(filePath)
  File "/opt/numenta/anaconda/lib/python2.7/site-packages/nupic.bindings-0.2.1-py2.7-linux-x86_64.egg/nupic/bindings/algorithms.py", line 2108, in loadFromFile
    return _algorithms.Cells4_loadFromFile(self, *args, **kwargs)
RuntimeError: CHECK FAILED: "marker == "out""

In nupic.core, some code uses the C++ file streams and other code uses the legacy C files streams (fopen/fwrite/fflush/fclose). These need to be dealt with differently. The C++ file streams need to be configured to raise an exception on certain errors. The legacy C stream errors need to be handled by checking for error after each of the fopen/fread/fwrite/fflush/fclose/etc. operations and raising an appropriate exception. To make it easier, some utility code should be created and used throughout, so that everyone wouldn't need to be responsible to writing correct error-handling logic for these operations. Also, it's important to note again that exceptions raised from destructors would likely be suppressed by the C++ runtime, so the flush/close operations need to be performed explicitly outside of the scope of any destructors. Here is an example of code that uses C++ file streams in a manner that is compatible with this discussion:

From https://github.com/numenta/nupic.core/blob/f316d6a6061667729fe0a24f41f517cd0ef8c74e/src/nupic/algorithms/Cells4.cpp#L1937-L1948:

/**
 * Save the state to the given file 
 */
void Cells4::saveToFile(std::string filePath) const
{
  OFStream outStream(filePath.c_str(), std::ios_base::out | std::ios_base::binary);
  // Request std::ios_base::failure exception upon logical or physical i/o error
  outStream.exceptions(std::ifstream::failbit | std::ifstream::badbit);

  outStream.precision(std::numeric_limits<double>::digits10 + 1);
  save(outStream);

  // Explicitly close the stream so that we may get an exception on error
  outStream.close();
}
rhyolight commented 8 years ago

Very well-reported and thoroughly investigated issue report, @vitaly-krugl!

africanamerican_girl_thumbs_up23