mikeyEcology / MLWIC

Machine Learning for Wildlife Image Classification
70 stars 16 forks source link

Assign requires shapes of both tensors to match. lhs shape= [5] rhs shape= [28] #11

Closed hannaboe closed 5 years ago

hannaboe commented 5 years ago

I tried to use train() to train a model with my own images but it does not work and the issue seems to be that I I have only 5 species/categories instead of 28. I resized the images to 256x256 pixels and used numbers from 0 to 4 in the image_label-csv but I get this error:

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [5] rhs shape= [28]
     [[node save/Assign_1 (defined at train.py:198)  = Assign[T=DT_FLOAT, _class=["loc:@output/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](output/biases, save/RestoreV2:1)]]

When I change the number of species to 3 or something else I get the same error with lhs shape= [3] . Do I need to to something else than specifying the number of species in num_classes? train and classify worked fine with the example images.

And I was wondering if it is necessary to resize the pictures to 256x256 pixels or if train() would also work with a different size?

This is my input:

train(path_prefix = "~/Hanna/camera_boxes/training_test/images", 
      data_info = "~/Hanna/camera_boxes/training/image_labels2.csv", 
      model_dir = "~/Hanna/MLWIC", 
      python_loc = "/software/anaconda2/bin/", 
      num_classes = 5, 
      log_dir_train = "training_output" 
)
mikeyEcology commented 5 years ago

For your image labels are you labeling them as 0,1,2,3,4? Or do you have a different labeling system?

hannaboe commented 5 years ago

I labeled them as 0, 1, 2, 3, 4. Using 0 for the first species, 1 for the second and so on.

mikeyEcology commented 5 years ago

It sounds like you're doing everything right and I'm not sure what's going on. The only other potential problem could be if your csv doesn't have Unix linebreaks? A temporary fix for you could be to trick it by saying that num_classes = 28, but if you keep your labeling scheme it won't affect the results.

hannaboe commented 5 years ago

Thank you. Using num_classes = 28 helped. I can start training a model now and it runs for quite a while. However at some point (around epoch 97 or 98) it always crashes and to me it seems like there is a file missing in 'training_output'. I tried training a model severel times with different number of images and classes but I always end up with the same. I also tried to run classify() with my model but that doesn't work.

That is the last part of my ouput:

2019-01-12 00:26:38.287313: epoch 98, step 1110, loss = 0.01, Top-1 = 1.00 Top-5 = 1.00 (40.8 examples/sec; 1.569 sec/batch) 2019-01-12 00:27:09.185133: epoch 98, step 1120, loss = 0.11, Top-1 = 0.97 Top-5 = 1.00 (41.9 examples/sec; 1.529 sec/batch) Traceback (most recent call last): File "train.py", line 339, in main() File "train.py", line 335, in main train(args) File "train.py", line 281, in train saver.save(sess, checkpoint_path, global_step= epoch) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1466, in save meta_graph_filename, strip_default_attrs=strip_default_attrs) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1511, in export_meta_graph strip_default_attrs=strip_default_attrs) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1800, in export_meta_graph **kwargs) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/meta_graph.py", line 1007, in export_scoped_meta_graph as_text=as_text) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/graph_io.py", line 73, in write_graph file_io.atomic_write_string_to_file(path, graph_def.SerializeToString()) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 434, in atomic_write_string_to_file write_string_to_file(temp_pathname, contents) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 314, in write_string_to_file f.write(file_content) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 208, in exit self.close() File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 240, in close pywrap_tensorflow.Set_TF_Status_from_Status(status, ret_status) File "/software/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnknownError: training_output_species4/snapshot-98.meta.tmp1b86cf9660114329b68c6c62178b1ec4; Input/output error [1] "training of model took 1.50564605285172 days. The trained model is in training_output_species4. Specify this directory as the log_dir when you use classify(). "

dlnorman6 commented 5 years ago

I have the same issue, only I have more than 28 species in my dataset so need to increase the number of classes. This is my error....

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [512,31] rhs shape= [512,28] [[node save/Assign_4 (defined at train.py:198) = Assign[T=DT_FLOAT, _class=["loc:@output/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](output/weights/Momentum, save/RestoreV2:4)]]

Is there somewhere else we need to amend the code to allow a different number of classes?

fabstp commented 5 years ago

Hi,

A bit out of topic, but do you guys know if we absolutely need to resized images in 256x256? My original size is 1920x1080 and when i resize the image becomes pixelated a lot. I feel like my model will have more difficulties to distinguish species on the resized pictures

I tried training without resizing. I get this message : ERROR:tensorflow:Exception in QueueRunner: Invalid JPEG data or crop window, data size 206074

but the program still continues

dwwolfson commented 5 years ago

Yes, images should be resized to 256x256. The model architecture is optimized using the number of pixels from a 256x256 image. Having way more pixels means the algorithm has to "look at" a ton more stuff. The model may run, but I'd expect performance to suffer. Be interested to hear the performance from resized vs un-resized images though.

fabstp commented 5 years ago

Hi @dwwolfson I trained a model using 2000 photos with something on it and 40 000 photos with nothing but moving vegetation. I did not cropped the pictures and training crashed at epoch 62 but i was still able to run the model on my photos (I used a subset from july to train and a subset from august to classify). In 95% + of the time, first guess is above 99% of accuracy. My model returns some false-positive (identifying something when the picture is empty) but very few false-negative which is great! Picture size doesnt seem to matter except maybe for the time necessary for processing. Classifying 7869 photos took 8,29 min and training crashed after 1,5 days.

eric-fegraus commented 5 years ago

getting a similar error as dlnorman6 above...changing num_classes to something besides 28 is causing me problems...looking into python code now

matobler commented 5 years ago

The problem is in the train() R function. The parameter "--retrain_from USDA182" is hard-coded which means it will continue training the model provided with the MLWIC code which has 28 classes. If you remove that part you can train your own model from scratch. There are several other parameters that are either missing or hard-coded in the R code (e.g. "--batch_size 128", num_epoches is missing). I had to tweak both the train() and classify() functions to make them more generic, but now training models with different classes work. If you run out of GPU memory try making --batch_size smaller, I am using 64. Hope that helps, happy to share my updated code.

mikeyEcology commented 5 years ago

I updated the train function so that you can specify retrain=FALSE if you want to train from scratch. Also, you can now specify batch_size. Using a smaller batch size will take longer, but it will be more accurate. It will be better if your batch size is a multiple of 64.

eric-fegraus commented 5 years ago

great looks good and works for training so far. You are right..I'll need to change the classify to let it know where to find the new, trained model.

fabstp commented 5 years ago

Hi,

I just trained a model with 2 classes (caribou vs empty photo) with the updated code specifying retrain=false and num_classes=2. Everything worked fine. However, when I try to classify using num_classes=2 I get the following error:

ValueError: input must have last dimension >= k = 5 but is 2 for 'TopKV2' (op: 'TopKV2') with input shapes: [?,2], [] and with computed input tensors: input[1] = <5>.

I'm pretty sure it has something to do with the 5 guess structure of the package. Since i don't have 5 species or more, Is there a way to tweak the package so I can just use a top 2 with the model I trained?

matobler commented 5 years ago

There is a parameter --top_n in the eval.py code that defaults to 5. The R code does not pass on that parameters to the Python function so you either have to edit the classify R function, call the python function directly from the command line, or edit the eval.py file and set the default to 2 (which might be easiest). Here the list of all the python parameters and their default values:

--load_size default= [256,256] --crop_size default= [224,224] --batch_size default= 512 --num_classes default= 1000 --top_n default= 5 --num_channels default= 3 --num_batches default=-1 --path_prefix default='./' --delimiter default=' ' --data_info default= 'val.txt' --num_threads default= 20 --architecture default= 'resnet' --depth default= 50 --log_dir default= None --save_predictions default= None

fabstp commented 5 years ago

I edited the eval.py and it works fine.

Thank you for quick and easy answer @matobler .

mikeyEcology commented 5 years ago

I edited the classify function so you can now specify top_n as a parameter.