khanh101 / tiny-yolo-tensorflow

Tiny yolov3 tensorflow
55 stars 13 forks source link

I can not test. #2

Open sounansu opened 6 years ago

sounansu commented 6 years ago

Hi. I tried to train by Make train command. But, I can not finished training. as below

Traceback (most recent call last):
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: loss
         [[Node: loss = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](loss/tag, TRAINER/add_8/_95)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./train.py", line 97, in <module>
    summary, _ , lossp, lxy, lwh, lobj, lnoobj, lp = sess.run([merge, trainer, loss, loss_xy, loss_wh, loss_obj, loss_noobj, loss_p], feed_dict = {X: Xp, Y1: Y1p, Y2:Y2p})
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: loss
         [[Node: loss = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](loss/tag, TRAINER/add_8/_95)]]

Caused by op 'loss', defined at:
  File "./train.py", line 73, in <module>
    tf.summary.histogram("loss", loss)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 193, in histogram
    tag=tag, values=values, name=scope)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: loss
         [[Node: loss = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](loss/tag, TRAINER/add_8/_95)]]

BTW, I can get ckpt data,

$ ls train_graph/
checkpoint                                tiny-yolo-10000.ckpt.meta                tiny-yolo-5000.ckpt.data-00000-of-00001  tiny-yolo-7500.ckpt.index                 tiny-yolo-final.ckpt.meta
events.out.tfevents.1538097615.ubuntu2    tiny-yolo-2500.ckpt.data-00000-of-00001  tiny-yolo-5000.ckpt.index                tiny-yolo-7500.ckpt.meta
tiny-yolo-10000.ckpt.data-00000-of-00001  tiny-yolo-2500.ckpt.index                tiny-yolo-5000.ckpt.meta                 tiny-yolo-final.ckpt.data-00000-of-00001
tiny-yolo-10000.ckpt.index                tiny-yolo-2500.ckpt.meta                 tiny-yolo-7500.ckpt.data-00000-of-00001  tiny-yolo-final.ckpt.index

So, I try to test at this ckpt data by 'make test -i data/dog.jpg' command. (I modify test.py

12c12
< saver.restore(sess,"./train_graph/tiny-yolo-final.ckpt")
---
> saver.restore("./train_graph/tiny-yolo-final.ckpt")
33c33
<     im_out = np.zeros((1, size, size, 3 ))
---
>     im_out = np.zeros(1, size, size, 3)

But I cannot get detect image.

Traceback (most recent call last):
  File "./test2.py", line 56, in <module>
    print(detect(im))
  File "./test2.py", line 29, in detect
    return sess.run(prediction, feed_dict = {X:Xp})
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/sounansu/anaconda3/envs/tiny-yolo-tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1104, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 416, 416, 3) for Tensor 'YOLO/input:0', which has shape '(32, 416, 416, 3)'
khanh101 commented 5 years ago

@sounansu Such a long time no touching on tensorflow.

  1. The training problem can come from many many reasons. One of them may be the drop out layer that I included. My suggestion is reading the yolo paper to know what kind of issues they did get.
  2. input shape is (32, 416, 416, 3). batch = 32. As I know, underlying implementation of Tensorflow is very optimized in memory use. While using a batch > 1, the cuda code probably splits records, feeds each one at a time in gpu and combines later. So I believe that TF supports changing input batch size.
FSet89 commented 5 years ago

I noticed the same problem. A temporary solution is to set batch size = 1 in create_graph, but the code needs to be changed in order to support a dynamic batch size in the placeholder