Saving and loading a model repeatedly causes it to break

melon3r commented 6 years ago

Hi!

I'm feeding data to a model in small batches, saving the model to disk at the end of each batch, and loading it again for the next one. After a few batches, the model stops working and throws the following error when calling model.run(input):

Traceback (most recent call last):
  File "./anomalies.py", line 63, in <module>
    result = model.run(input)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/frameworks/opf/htm_prediction_model.py", line 448, in run
    inferences = self._anomalyCompute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/frameworks/opf/htm_prediction_model.py", line 696, in _anomalyCompute
    self._getAnomalyClassifier().compute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/engine/__init__.py", line 433, in compute
    return self._region.compute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/bindings/engine_internal.py", line 1499, in compute
    return _engine_internal.Region_compute(self)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/bindings/regions/PyRegion.py", line 184, in guardedCompute
    return self.compute(inputs, DictReadOnlyWrapper(outputs))
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 326, in compute
    self._classifyState(record)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 405, in _classifyState
    self._addRecordToKNN(state)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 490, in _addRecordToKNN
    knn.learn(pattern, category, rowID=rowID)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/algorithms/knn_classifier.py", line 537, in learn
    inputPattern = numpy.dot(self._vt, inputPattern - self._mean)
ValueError: operands could not be broadcast together with shapes (65536,) (0,)

Here's the code used to load and store the model:

with open(model_file, 'r') as f:
    model = HTMPredictionModel.readFromFile(f)

with open(model_file, 'w') as f:
    model.writeToFile(f)

I've tried using a model generated from a previous batch and skipping some batches of data, to find out if it was the data that was somehow generating a bad model, but after the same number of batches, no matter their contents, I get to a broken model again. Thus, I suspect a bug is being triggered at readFromFile or writeToFile (or maybe I'm just doing it wrong).

This is with Python 2.7.9, and nupic 1.0.3 from pypi.

rhyolight commented 6 years ago

Hey @lscheinkman and @scottpurdy, this might be another report similar to #3783.

@melon3r Can you perhaps attach some code we can run to replicate this?

ghost commented 6 years ago

@melon3r Can you try this...it's working fine for our project. We also found that you can compress the binary data here quite a bit...

from nupic.frameworks.opf.htm_prediction_model import HTMPredictionModel

    def serialize_htm(htm_model):
        proto = HTMPredictionModel.getSchema()
        builder = proto.new_message()
        htm_model.write(builder)
        return builder.to_bytes_packed() //returns binary data of htm_model

    def deserialize_htm(htm_buffer):
        proto = HTMPredictionModel.getSchema()
        reader = proto.from_bytes_packed(htm_buffer)
        return HTMPredictionModel.read(reader) //returns htm_model from binary data

Also, there is a https://github.com/numenta/nupic/issues/3805 minor bug in Nupic now where if you attempt to serialize and deserialize without processing any samples in between it will error out.

melon3r commented 6 years ago

Hey @kyle-sorensen, thank you for the tip, but it didn't work out for me. The model breaks at the exact same point.

@melon3r Can you perhaps attach some code we can run to replicate this?

@rhyolight I'll try to build a small script to reproduce it and share it ;)

rhyolight commented 6 years ago

Thanks @melon3r. Numenta engineer @lscheinkman is working on updating our regression test suite so that we serialize our models in the middle of running the NAB data set, then continue after de-serialization. We hope to see this test fail so we can fix the issue and update the source code. Your script might still be helpful, so please continue with it if you can.

melon3r commented 6 years ago

I found the "issue". :man_facepalming:

Trying to replicate it I found it was always failing at the same record, the 2184th, with this config in the model parameters: 'autoDetectWaitRecords': 2184

I just copied if from the HotGym example, so I don't even understand it... Can you help?

rhyolight commented 6 years ago

@melon3r Can you try either removing it from the configuration or (if that doesn't work) making it extremely large? Then try again? If it works at least we know what to fix.

melon3r commented 6 years ago

Hi @rhyolight,

Removing it from the configuration gave it a default value of 4000. I could configure it to be very high, but I don't think that's how it's supposed to be run on production? Are models not supposed to run indefinitely?

What's this configuration actually doing? Debugging the error I found that after processing this number of records, flow changes and it starts doing something with a knn anomaly classification region, which it didn't before. What's the difference between the process before and after this threshold is reached?

rhyolight commented 6 years ago

It has to do with something unrelated to HTM. It is a legacy setting that is just causing trouble, and we should remove it. It is not affecting how the HTM runs, it's just expressing a bug. Set it to 999999999.

melon3r commented 6 years ago

Alright, thanks. 999999999 that makes for 1900 years of records, at one record per minute so I guess it'll be good :)

rhyolight commented 6 years ago

@lscheinkman found that this was still happening when he starting writing more tests for https://github.com/numenta/nupic/issues/3808.

numenta / nupic-legacy

Saving and loading a model repeatedly causes it to break #3820