melodyguan / enas

TensorFlow Code for paper "Efficient Neural Architecture Search via Parameter Sharing"
https://arxiv.org/abs/1802.03268
Apache License 2.0
1.58k stars 390 forks source link

Errors in attempt at reproducing micro search #26

Open ahundt opened 6 years ago

ahundt commented 6 years ago

I tried running micro search on TF 1.7 and it made quite a bit of progress, up to 150 epochs, but then it failed out as follows:

[1 2 1 1 1 3 0 2 2 0 1 1 1 1 1 4 1 4 1 4]
val_acc=0.7750
--------------------------------------------------------------------------------
[0 0 1 0 0 4 0 1 0 4 1 1 1 4 0 1 0 1 5 2]
[0 1 1 0 1 1 1 0 1 2 1 3 1 0 3 3 1 0 2 4]
val_acc=0.6813
--------------------------------------------------------------------------------
[0 1 1 0 0 0 0 0 0 0 1 1 4 0 0 0 0 0 1 1]
[1 0 1 2 1 1 1 1 1 0 1 3 3 0 2 0 1 0 1 1]
val_acc=0.7312
--------------------------------------------------------------------------------
[0 1 0 4 0 0 0 2 1 0 1 3 1 0 3 0 1 1 1 1]
[1 0 1 0 1 1 1 1 1 4 1 1 1 1 1 0 3 4 1 4]
val_acc=0.7188
--------------------------------------------------------------------------------
[0 0 0 2 1 0 1 0 1 4 0 3 0 1 1 0 0 1 4 2]
[0 4 1 1 1 4 1 1 1 1 1 0 1 0 1 2 1 1 1 2]
val_acc=0.7250
--------------------------------------------------------------------------------
Epoch 150: Eval
Eval at 42300
valid_accuracy: 0.6946
Eval at 42300
test_accuracy: 0.6842
Exception in thread QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 268, in _run
    coord.request_stop(e)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 213, in request_stop
    six.reraise(*sys.exc_info())
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
    enqueue_callable()
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/ahundt/.local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
CancelledError: TakeGrad operation was cancelled
         [[Node: sync_replicas/AccumulatorTakeGradient = AccumulatorTakeGradient[_class=["loc:@sync_replicas/conditional_accumulator"], dtype=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sync_replicas/conditional_accumulator, sync_replicas/AccumulatorTakeGradient/num_required)]]
         [[Node: sync_replicas/AccumulatorTakeGradient_2/_16859 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_93_sync_replicas/AccumulatorTakeGradient_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

I didn't take any steps to cancel it like hitting ctrl+c so I'm not sure why this is occurring.

MattVil commented 6 years ago

did you solve this issue ? I have the same Error a the end of my search

ahundt commented 6 years ago

I think so, look at the pull request I made

ahundt commented 6 years ago

https://github.com/melodyguan/enas/pull/29