Swarm fails when swarm config uses extra metrics

JonnoFTW commented 6 years ago

When I run a swarm with an extra metric, for example, by adding this to my swarm config:

"metrics": [{
      "field": "flow",
      "metric": "multiStep",
      "inferenceElement": "multiStepBestPredictions",
      "params": {
        "errorMetric": "rmse",
        "window": 100,
        "steps": [1,2]
      }
}]

I get the following error:

Traceback (most recent call last):
  File "./swarm.py", line 10, in <module>
    model_params = permutations_runner.runWithConfig(swarm_config, {'maxWorkers': 2, 'overwrite': True},  verbosity=1)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 257, in runWithConfig
    _generateExpFilesFromSwarmDescription(swarmConfig, outDir)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 188, in _generateExpFilesFromSwarmDescription
    "--outDir=%s" % (outDir)])
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 2026, in expGenerator
    claDescriptionTemplateFile = options.claDescriptionTemplateFile)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 190, in _handleDescriptionOption
    claDescriptionTemplateFile = claDescriptionTemplateFile)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 1573, in _generateExperiment
    _generateMetricsSubstitutions(options, tokenReplacements)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 1632, in _generateMetricsSubstitutions
    metricList, optimizeMetricLabel = _generateMetricSpecs(options)
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 1676, in _generateMetricSpecs
    metricSpecStrings.extend(_generateExtraMetricSpecs(options))
  File "/scratch/mack0242/.pyenv/versions/2.7.12/lib/python2.7/site-packages/nupic/swarming/exp_generator/experiment_generator.py", line 1866, in _generateExtraMetricSpecs
    for propertyName in _metricSpecSchema['properties'].keys():
NameError: global name '_metricSpecSchema' is not defined

Additionally I'd like to:

Specify which metric I want to minimise without having to modify the generated permutations.py file in a swarm.
Specify which metrics I want to use in my swarm (so I can override the default aae and altMAPE metrics)

rhyolight commented 6 years ago

Looks like a bug in this function:

https://github.com/numenta/nupic/blob/master/src/nupic/swarming/exp_generator/experiment_generator.py#L1859-L1882

_metricSpecSchema is marked as global but never defined outside the _generateExtraMetricSpecs function itself. In fact, I don't see that variable name ever used anywhere else in the entire codebase, but it is expected to be a dict-like object with a properties.

Stupid fix might be to add to top of this file:

global _metricSpecSchema = {}
_metricSpecSchema['properties'] = {}

I dunno. @scottpurdy any suggestions?

scottpurdy commented 6 years ago

I'm not sure but it is certainly a bug. I think your proposed fix will likely work. @JonnoFTW do you have a branch that you can share that exhibits this problem so we can validate the change?

JonnoFTW commented 6 years ago

It fixes the bug, but I now get the bug where jobInfo.results = None. My code looks like this:

from nupic.swarming import permutations_runner
import json
import logging
logging.basicConfig()
if __name__ == "__main__":
    with open('swarm_config.json', 'r') as conf:
        swarm_config = json.load(conf)
    model_params = permutations_runner.runWithConfig(swarm_config, {'maxWorkers': 1, 'overwrite': True},  verbosity=2)
    print(json.dumps(model_params))

I get the error:

Generating experiment files in directory: /tmp/tmp3Lmcl3...
Writing  314 lines...
Writing  114 lines...
done.
None
json.loads(jobInfo.results) raised an exception.  Here is some info to help with debugging:
jobInfo:  _jobInfoNamedTuple(jobId=1022, client=u'GRP', 
clientInfo=u'', clientKey=u'', 
cmdLine=u'$HYPERSEARCH', 
params=u'{"hsVersion": "v2", 
  "maxModels": null, "persistentJobGUID": "e8c4caf6-1d07-11e8-afb1-3417ebcbdfa4", 
   "useTerminators": false, 
   "description": {
     "inferenceArgs": {"predictionSteps": [1, 2], "predictedField": "flow"}, 
    "iterationCount": -1,
    "swarmSize": "medium",
    "metrics": [{"field": "flow", "metric": "multiStep", "inferenceElement": "multiStepBestPredictions", "logged": true, "params": {"window": 100, "steps": [1, 2], "errorMetric": "rmse"}}], 
   "includedFields": [{"fieldName": "datetime", "fieldType": "datetime"},
                {"minValue": 0, "fieldName": "cycle_time", "fieldType": "int", 
"maxValue": 190}, {"minValue": 0, "fieldName": "flow", "fieldType": "int", "maxValue": 100}], 
"streamDef": {"info": "StrategicInput flow per phase", "version": 1,
 "streams": [{"info": "Traffic flow through intersection 113 SI 108", "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv", 
"columns": ["datetime", "flow", "cycle_time"], "last_record": 500}]}, 
"inferenceType": "TemporalMultiStep",
 "customErrorMetric": {"field": "flow", "metric": "multiStep", "inferenceElement": "multiStepBestPredictions", "logged": true,
 "params": {"window": 100, "steps": [1, 2], "errorMetric": "rmse"}}}}', 
jobHash='\xe8\xc4\xd1\xd6\x1d\x07\x11\xe8\xaf\xb14\x17\xeb\xcb\xdf\xa4', status=u'notStarted', 
completionReason=None, completionMsg=None, workerCompletionReason=u'success', 
workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, 
engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=1, priority=0, 
engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, 
lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, 
genPermutations=None, engLastUpdateTime=datetime.datetime(2018, 3, 1, 4, 20, 44), 
engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results:  None
EXCEPTION:  expected string or buffer
Traceback (most recent call last):
  File "/scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm.py", line 12, in <module>
    model_params = permutations_runner.runWithConfig(swarm_config, {'maxWorkers': 1, 'overwrite': True},  verbosity=2)
  File "/scratch/nupic/nupic/src/nupic/swarming/permutations_runner.py", line 271, in runWithConfig
    return _runAction(runOptions)
  File "/scratch/nupic/nupic/src/nupic/swarming/permutations_runner.py", line 212, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/scratch/nupic/nupic/src/nupic/swarming/permutations_runner.py", line 155, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/scratch/nupic/nupic/src/nupic/swarming/permutations_runner.py", line 822, in generateReport
    results = json.loads(jobInfo.results)
  File "/home/CSEM/mack0242/.pyenv/versions/2.7.12/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/home/CSEM/mack0242/.pyenv/versions/2.7.12/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

rhyolight commented 6 years ago

OMG I hate this bug. I thought I had fixed it several times in the past.

@JonnoFTW Is there any way you can isolate this and share your swarm def and script so I can try and replicate?

JonnoFTW commented 6 years ago

I installed nupic from the source folder using python setup.py develop (using install doesn't fix it either), using the fix to experiment_generator.py at line 1861:

 _metricSpecSchema = {'properties': {}}

My swarm_config.json is:

{
  "includedFields": [
    {
      "fieldName": "datetime",
      "fieldType": "datetime"
    },
    {
      "fieldName": "cycle_time",
      "fieldType": "int",
      "maxValue": 190,
      "minValue": 0
    },
    {
      "fieldName": "flow",
      "fieldType": "int",
      "maxValue": 100,
      "minValue": 0
    }
  ],
  "inferenceType": "TemporalMultiStep",
  "inferenceArgs": {
    "predictionSteps": [
      1, 2
    ],
    "predictedField": "flow"
  },
  "streamDef": {
    "info": "flow per phase",
    "streams": [
      {
        "columns": [
          "datetime",
          "flow",
          "cycle_time"
        ],
        "last_record": 500,
        "info": "Traffic flow through intersection 113 SI 108",
        "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv"
      }
    ],
    "version": 1
  },
  "customErrorMetric":{
      "field": "flow",
      "metric": "multiStep",
      "logged": true,
      "inferenceElement": "multiStepBestPredictions",
      "params": {
        "errorMetric": "rmse",
        "window": 100,
        "steps": [1,2]
      }
},
  "iterationCount": -1,
  "swarmSize": "medium",
  "metrics": [{
      "field": "flow",
      "metric": "multiStep",
      "logged": true,
      "inferenceElement": "multiStepBestPredictions",
      "params": {
        "errorMetric": "rmse",
        "window": 100,
        "steps": [1,2]
      }
}]
}

rhyolight commented 6 years ago

Thanks, I'll take a look at this.

rhyolight commented 6 years ago

What command are you running to do the swarm? and from what directory?

JonnoFTW commented 6 years ago

I'm using the following script from my /scratch/Dropbox directory:

from nupic.swarming import permutations_runner
import os
import json
import logging
logging.basicConfig()
if __name__ == "__main__":
    with open('swarm_config.json', 'r') as conf:
        swarm_config = json.load(conf)
    model_params = permutations_runner.runWithConfig(swarm_config, {'maxWorkers': 1, 'overwrite': True},  verbosity=2)

rhyolight commented 6 years ago

I also need a sample of your data file file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv.

JonnoFTW commented 6 years ago

I sent you a link to the data on discourse because it is private.

rhyolight commented 6 years ago

My swarm test has run 48 models so far with no problems. Still waiting for the error. Did you get it right away?

The only changes I made was this:

global _metricSpecSchema
_metricSpecSchema = {}
_metricSpecSchema['properties'] = {}

JonnoFTW commented 6 years ago

I only used this since there's no reason to make it global:

_metricSpecSchema = {'properties': {}}

rhyolight commented 6 years ago

I ran your script and only changed the data file path, was successful:

Field Contributions:
{   u'cycle_time': 100.0,
    u'datetime_dayOfWeek': 100.0,
    u'datetime_timeOfDay': 100.0,
    u'datetime_weekend': 100.0,
    u'flow': 100.0}

Best results on the optimization metric multiStepPredictions:multiStep:errorMetric='custom_error_metric':errorWindow=1000:field=flow:inferenceElement=multiStepBestPredictions:logged=True:metric=multiStep:params={u'window': 100, u'steps': [1, 2], u'errorMetric': u'rmse'}:steps=[1, 2]:field=flow (maximize=False):
[0] Experiment _NupicModelInfo(jobID=1001, modelID=1006, status=completed, completionReason=eof, updateCounter=16, numRecords=500) (modelParams|tmParams|minThreshold_11.modelParams|tmParams|activationThreshold_14.modelParams|tmParams|pamLength_3.modelParams|clParams|alpha_0.05005.modelParams|spParams|synPermInactiveDec_0.05015.modelParams|sensorParams|encoders|_classifierInput|n_275.modelParams|sensorParams|encoders|cycle_time:n_272):
  multiStepPredictions:multiStep:errorMetric='custom_error_metric':errorWindow=1000:field=flow:inferenceElement=multiStepBestPredictions:logged=True:metric=multiStep:params={u'window': 100, u'steps': [1, 2], u'errorMetric': u'rmse'}:steps=[1, 2]:field=flow:  0

Total number of Records processed: 30500

Total wall time for all models: 626

Generating description files for top 1 models...
Generating description file for model 1006 at /Users/mtaylor/nta/nupic/scratch/model_0
Generating model params file...

Report csv saved in /Users/mtaylor/nta/nupic/scratch/default_Report.csv
Elapsed time (h:mm:ss): 0:10:32
Hypersearch ClientJobs job ID:  1001

I'll try with the change you just mentioned only...

rhyolight commented 6 years ago

Jonathan what version of NuPIC are you running? I'm using tip of master.

rhyolight commented 6 years ago

This seemed to work for me as well. Are you sure your local swarm description file is properly formed?

JonnoFTW commented 6 years ago

I'm running from master with that change to the metric schema. Reformatting my swarm_config does not fix it. I set verbosity=3, and got the following:

/home/CSEM/mack0242/.pyenv/versions/2.7.12/bin/python /scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm.py
/scratch/nupic/nupic/
Generating experiment files in directory: /scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model...
Writing  314 lines...
Writing  114 lines...
done.
None
_NupicJob: 
_jobInfoNamedTuple(jobId=1040, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "ef845d1e-219b-11e8-a5fc-3417ebcbdfa4", "useTerminators": false, "description": {"inferenceArgs": {"predictionSteps": [1, 2], "predictedField": "flow"}, "iterationCount": -1, "swarmSize": "medium", "includedFields": [{"fieldName": "datetime", "fieldType": "datetime"}, {"minValue": 0, "fieldName": "cycle_time", "fieldType": "int", "maxValue": 190}, {"minValue": 0, "fieldName": "flow", "fieldType": "int", "maxValue": 100}], "streamDef": {"info": "StrategicInput flow per phase", "version": 1, "streams": [{"info": "Traffic flow through intersection 113 SI 108", "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv", "columns": ["datetime", "flow", "cycle_time"], "last_record": 500}]}, "inferenceType": "TemporalMultiStep"}}', jobHash='\xef\x84d0!\x9b\x11\xe8\xa5\xfc4\x17\xeb\xcb\xdf\xa4', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=1, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2018, 3, 7, 0, 10, 25), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
Successfully submitted new HyperSearch job, jobID=1040
Each worker executing the command line: python -m nupic.swarming.hypersearch_worker --jobID=1040
JobStatus: 
_jobInfoNamedTuple(jobId=1040, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "ef845d1e-219b-11e8-a5fc-3417ebcbdfa4", "useTerminators": false, "description": {"inferenceArgs": {"predictionSteps": [1, 2], "predictedField": "flow"}, "iterationCount": -1, "swarmSize": "medium", "includedFields": [{"fieldName": "datetime", "fieldType": "datetime"}, {"minValue": 0, "fieldName": "cycle_time", "fieldType": "int", "maxValue": 190}, {"minValue": 0, "fieldName": "flow", "fieldType": "int", "maxValue": 100}], "streamDef": {"info": "StrategicInput flow per phase", "version": 1, "streams": [{"info": "Traffic flow through intersection 113 SI 108", "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv", "columns": ["datetime", "flow", "cycle_time"], "last_record": 500}]}, "inferenceType": "TemporalMultiStep"}}', jobHash='\xef\x84d0!\x9b\x11\xe8\xa5\xfc4\x17\xeb\xcb\xdf\xa4', status='running', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=1, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2018, 3, 7, 0, 10, 25), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
Current number of models is 0 (0 of them completed)
JobStatus: 
_jobInfoNamedTuple(jobId=1040, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "ef845d1e-219b-11e8-a5fc-3417ebcbdfa4", "useTerminators": false, "description": {"inferenceArgs": {"predictionSteps": [1, 2], "predictedField": "flow"}, "iterationCount": -1, "swarmSize": "medium", "includedFields": [{"fieldName": "datetime", "fieldType": "datetime"}, {"minValue": 0, "fieldName": "cycle_time", "fieldType": "int", "maxValue": 190}, {"minValue": 0, "fieldName": "flow", "fieldType": "int", "maxValue": 100}], "streamDef": {"info": "StrategicInput flow per phase", "version": 1, "streams": [{"info": "Traffic flow through intersection 113 SI 108", "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv", "columns": ["datetime", "flow", "cycle_time"], "last_record": 500}]}, "inferenceType": "TemporalMultiStep"}}', jobHash='\xef\x84d0!\x9b\x11\xe8\xa5\xfc4\x17\xeb\xcb\xdf\xa4', status='completed', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=1, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2018, 3, 7, 0, 10, 25), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
Current number of models is 0 (0 of them completed)
Evaluated 0 models
HyperSearch finished!
JobStatus: 
_jobInfoNamedTuple(jobId=1040, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "ef845d1e-219b-11e8-a5fc-3417ebcbdfa4", "useTerminators": false, "description": {"inferenceArgs": {"predictionSteps": [1, 2], "predictedField": "flow"}, "iterationCount": -1, "swarmSize": "medium", "includedFields": [{"fieldName": "datetime", "fieldType": "datetime"}, {"minValue": 0, "fieldName": "cycle_time", "fieldType": "int", "maxValue": 190}, {"minValue": 0, "fieldName": "flow", "fieldType": "int", "maxValue": 100}], "streamDef": {"info": "StrategicInput flow per phase", "version": 1, "streams": [{"info": "Traffic flow through intersection 113 SI 108", "source": "file:///scratch/Dropbox/PhD/htm_models_adelaide/engine/sm_model/swarm_data/113_108.csv", "columns": ["datetime", "flow", "cycle_time"], "last_record": 500}]}, "inferenceType": "TemporalMultiStep"}}', jobHash='\xef\x84d0!\x9b\x11\xe8\xa5\xfc4\x17\xeb\xcb\xdf\xa4', status='completed', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=1, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2018, 3, 7, 0, 10, 25), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
Worker completion message: None

This makes me suspect that it's an environment error on my end when the script executes:

python -m nupic.swarming.hypersearch_worker --jobID=1040

JonnoFTW commented 6 years ago

Okay, I fixed it by not running swarm.py from inside of pycharm, it didn't have my regular environment stuff (namely pyenv) so calling python -m used the wrong python. Running from regular terminal worked just fine.

Models are running fine with my updated metrics, although it would be nice to be given the choice not to use the default aae and altmape error scores to save time. Although I'm happy that the generated permutations.py uses minimize = ".*custom_error_metric.*". The metrics field in description.py will have a duplicate though if use both customErrorMetric and metrics in my swarm description.

Perhaps it would be useful to have the stderr from the hypersearch_worker process be outputted to the user stderr to make it easier to debug when verbosity=3?

rhyolight commented 6 years ago

Perhaps it would be useful to have the stderr from the hypersearch_worker process be outputted to the user stderr to make it easier to debug when verbosity=3?

Sounds like a good idea.

numenta / nupic-legacy

Swarm fails when swarm config uses extra metrics #3804