tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

Unable to Run EfficientNet-edgetpu on local machine in eval mode #597

Open sarahmass opened 5 years ago

sarahmass commented 5 years ago

@mingxingtan , I would appreciate your help on this. I have read through many of the issues here on line already and I have not found an answer. I have been toiling with this for about a week now. :(

I can get the model to build, but I can not run an evaluation on the edgetpu model, but I don't think it is a model issue I think it is an input issue. Please correct me if I am wrong, I appreciate any guidance to get this to run.

This is what I have done and the results:

  1. Using eval_ckpt_main.py does not appear to want the tfrecords that the readme says to use for main.py. I can run the eval_chkpt_example on several images and get predictions but I want to run evaluation on imagenet on my local machine using the saved ckpt for the edgetpu-S model. When I fed the .jpg files into the model from the imagenet raw data files I get it to run, but the model does not recognize the n######## as labels or the label maps. What form is the data suppose to be in?

  2. I then moved onto main.py and used mode=eval If I use the eval_batch_size of 1024 I run out of memory. This happens when I run it in an environment that uses only CPU and an environment that uses GPU. But when reduce the batch size to 512 I still run out of memory. Then at 128 I get the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Input to reshape is a tensor with 128128 values, but the requested shape requires a multiple of 1000 [[node softmax_cross_entropy_loss/xentropy/Reshape (defined at C:\Users\v-sarmas.conda\envs\tf-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] [[Identity/_1517]] (1) Invalid argument: Input to reshape is a tensor with 128128 values, but the requested shape requires a multiple of 1000 [[node softmax_cross_entropy_loss/xentropy/Reshape (defined at C:\Users\v-sarmas.conda\envs\tf-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

  1. So, how do I run an evaluation of Imagenet using an edgeTPU model on my local machine to verify the published accuracy?
sarahmass commented 5 years ago

update:
I went back to trying to run eval_ckpt_main.py.
The input needs to be .jpg I have all the jpg files separated in folders based on their synset labels so I called all the images in order and created a label file based on the folder they came from in order. I was able to achieve the evaluation accuracy a about a half of a percent better than the listed accuracy for edgeTPU-S.

mingxingtan commented 5 years ago

@sarahmass Sorry for causing these troubles!

Our eval_ckpt_main.py is mainly for users who don't have tfrecord (if you have tfrecord, you can directly run main.py and set runmode to 'eval').

Looks like you have already solved this issue?

sarahmass commented 5 years ago

I still can't get it to run on main.py I don't know why I can't reduce the eval_batch_size to get around the OOM issue. I will need to fine tune next, so I need to be able to get main.py to run. Have you seen the error I posted above?

mingxingtan commented 5 years ago

The eval batch size might be too big, can you use a smaller batch size?

sarahmass commented 5 years ago

When I reduce the batch size to 128 I get the error below. I get similar input reshape errors with any size less than this as well. I don't know what is causing it.
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Input to reshape is a tensor with 128128 values, but the requested shape requires a multiple of 1000 [[node softmax_cross_entropy_loss/xentropy/Reshape (defined at C:\Users\v-sarmas.conda\envs\tf-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] [[Identity/_1517]] (1) Invalid argument: Input to reshape is a tensor with 128128 values, but the requested shape requires a multiple of 1000 [[node softmax_cross_entropy_loss/xentropy/Reshape (defined at C:\Users\v-sarmas.conda\envs\tf-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

sarahmass commented 5 years ago

Update: I know what the problem is. It has to do with the fact that the edgeTPU model checkpoints were created using 1001 classes while the rest of the models were created using the 1000 classes. I mentioned this in another issue #594.

I have figured out a work around.

include_background_label = (FLAGS.include_background_label)

sarahmass commented 5 years ago

Update: The code will run, but there is an offset in the class labels so the accuracy is off... I will be working on another work around.

I commented out lines 177-179 in imagenet_input.py in addition to the above work around.
Again, this is only an issue for the edgeTPU models.

sarahmass commented 5 years ago

@mingxingtan, I'm sorry to bother you but there is one more issue with running main.py in eval mode. After it evaluates the pretrained model on imagenet data waits for a new checkpoint to evaluate. Why does it do this? I know I can just change the eval_timeout flag to stop it looking for other checkpoints, but I'm still curious why this is happening.

mingxingtan commented 5 years ago

@sarahmass This is for continuous eval. The common flow is like this: we start a train job and a eval job at the same time; the train job keeps running and generate checkpoints periodically; the eval job keep reading the checkpoints, if new checkpoints are generated, it start to eval the checkpoints. Does it make sense to you?

For your case, you only need to eval a single checkpoint, so you can (1) either change timeout; (2) or change the train_steps to be smaller than your checkpoint global step value.

sarahmass commented 5 years ago

@mingxingtan, Thank you. That does make sense.