matpalm / bnn

bee detection tensorflow conv net for a rasp pi on side of a hive
http://matpalm.com/blog/counting_bees/
MIT License
405 stars 67 forks source link

minimal example running on NCS #8

Closed matpalm closed 5 years ago

matpalm commented 6 years ago

cc @squeakus

finally have a version of this network running on the NCS :)

( this image was calculated from the stick )

from_ncs

currently all code is on a hacky branch ncs_poc just to prove things work, have to clean up a fair bit and merge everything back to master.

see this README for repro instructions

squeakus commented 6 years ago

Happy days! I'm delighted you have got it working! I am currently collecting another dataset but hope to get it running on the NCS soon.

squeakus commented 6 years ago

I finally got a chance to test it out tonight. It looks like it is working but I had to remove an encoding layer as my images are 640x480 and a 512 patch size is too big. I changed it to look like this: input (?, 256, 256, 3) #196608 e1 (?, 127, 127, 16) #258064 e2 (?, 63, 63, 32) #127008 e3 (?, 31, 31, 64) #61504 e4 (?, 15, 15, 128) #28800 d1 (?, 31, 31, 64) #61504 d2 (?, 63, 63, 32) #127008 d3 (?, 127, 127, 16) #258064 logits (?, 127, 127, 1) #16129

Is that the correct way to do it?

Also I uncommented the modelTester code (I love me some stats) and got the following error: full res test model... WARNING: ncs_hacktastic input (?, 480, 640, 3) #921600 e1 (?, 239, 319, 16) #1219856 e2 (?, 119, 159, 32) #605472 e3 (?, 59, 79, 64) #298304 e4 (?, 29, 39, 128) #144768 d1 (?, 59, 79, 64) #298304 d2 (?, 119, 159, 32) #605472 d3 (?, 239, 319, 16) #1219856 logits (?, 239, 319, 1) #76241 ValueError: logits and labels must have the same shape ((?, 239, 319, 1) vs (?, 127, 127, 1))

It seems to be picking up the image size, not the patch size. How do I best mix patches with the testing code?

matpalm commented 6 years ago

yeah that (256,256) -> (127,127) all looks good.

with respect to the (127,127) you're sadly hitting some hard coded stuff i have in there... it's this bit of code which is an explicit slice/reshape workaround for the size/shape of the 2d output being wrong

it's clumsy i know, but that could be configurable (until there's a fix..)

squeakus commented 6 years ago

ahh, would it be quicker for me to just crop the test images to 127,127 or will it work if I change the shape of the output?

matpalm commented 6 years ago

changing the code to match your size would probably be the quickest...

squeakus commented 6 years ago

I finally got a bit of time this morning and managed to get it working from start to NCS finish! Unfortunately the results were not great. I went back to train.py and uncommented the test code to see how well the training was working. There was an issue with the training network set up for a certain patch size and the test network being used on the full image so I turned off the patches and changed the image shape in data.py to resize to 239x319. The network topology and labels now match up: patch train model... input (?, 480, 640, 3) #921600 e1 (?, 239, 319, 16) #1219856 e2 (?, 119, 159, 32) #605472 e3 (?, 59, 79, 64) #298304 e4 (?, 29, 39, 128) #144768 d1 (?, 59, 79, 64) #298304 d2 (?, 119, 159, 32) #605472 d3 (?, 239, 319, 16) #1219856 logits (?, 239, 319, 1) #76241

but when I run the training I get the following error: tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800 [[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

so 240x320 is 76800 but I cannot see anywhere in the code where the tensor is being set to 19200 and I am starting to realise that tensorflow is difficult to debug to say the least!

Do you have any suggestions to see where this is getting set of for debugging tensorflow models?

squeakus commented 6 years ago

I thought it may have been the shape of my images so I resized them to match the patch size, I also resized my labels to 64x64 with nearest neighbour interpolation. I get the same error despite the dumped shapes of the models being identical

So it works with the patch flag but not without. This means it is either something wrong with my labels or i'm missing something in xys_iterator. I tried tfdbg but it is hard to see what is going on....

matpalm commented 6 years ago

yeah, it's been a nightmare to debug... i've also made this repo now more complicated than it needs to be because i've been confounding two things 1) running a patch batched model with fixed sized inference to run on the NCS and 2) training patch based and running on arbitrary sized output for my meta learning experiments; i should really move 2) into it's own repo since it requires different things than 1) on the data pipeline.... but that's an aside...

are you trying to run with an output of (239,319) on the NCS? i recall having a problem where i couldn't get anything over (127,127) as output on the stick...

can you share a larger stack trace around the tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800 line ?

squeakus commented 6 years ago

It seemed to compile and run with 239,219 but the output was a mess. I will resize it if I hit the same 127,127 limitation

Here is the stack trace and some additional information, as the expected size (19200) is 120 x 160 and the value being passed (76800) is 240 x 320 I think it might be the output of a particular layer is the wrong size. I used the slim model analyzer to get more info but still cannot see anything wrong.

$ ./train.py --run $RUN --steps $STEPS --train-steps 1000 --train-image-dir $DATADIR/train/ --test-image-dir $DATADIR/test/ --label-dir $DATADIR/labels/ --no-use-batch-norm --no-use-skip-connections --width 640 --height 480 --label-rescale 0.25 opts Namespace(base_filter_size=8, batch_size=32, flip_left_right=False, height=480, label_dir='data/1850/labels/', label_rescale=0.25, learning_rate=0.001, no_use_batch_norm=True, no_use_skip_connections=True, patch_width_height=None, random_rotate=False, run='r2', secs=None, steps=2000, test_image_dir='data/1850/test/', train_image_dir='data/1850/train/', train_steps=1000, width=640) len(rgb_filenames) 1401 NO CACHE WARNING: ncs_hacktastic patch train model... input (?, 480, 640, 3) #921600 e1 (?, 239, 319, 16) #1219856 e2 (?, 119, 159, 32) #605472 e3 (?, 59, 79, 64) #298304 e4 (?, 29, 39, 128) #144768 d1 (?, 59, 79, 64) #298304 d2 (?, 119, 159, 32) #605472 d3 (?, 239, 319, 16) #1219856 logits (?, 239, 319, 1) #76241

Variables: name (type shape) [size]

train_test_model/e1/weights:0 (float32_ref 3x3x3x16) [432, bytes: 1728] train_test_model/e1/biases:0 (float32_ref 16) [16, bytes: 64] train_test_model/e2/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432] train_test_model/e2/biases:0 (float32_ref 32) [32, bytes: 128] train_test_model/e3/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728] train_test_model/e3/biases:0 (float32_ref 64) [64, bytes: 256] train_test_model/e4/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912] train_test_model/e4/biases:0 (float32_ref 128) [128, bytes: 512] train_test_model/d1/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912] train_test_model/d1/biases:0 (float32_ref 64) [64, bytes: 256] train_test_model/d2/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728] train_test_model/d2/biases:0 (float32_ref 32) [32, bytes: 128] train_test_model/d3/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432] train_test_model/d3/biases:0 (float32_ref 16) [16, bytes: 64] train_test_model/d4/weights:0 (float32_ref 3x3x16x1) [144, bytes: 576] train_test_model/d4/biases:0 (float32_ref 1) [1, bytes: 4] Total size of variables: 194465 Total bytes of variables: 777860 2018-09-25 07:46:07.989531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-09-25 07:46:07.989915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 10.34GiB 2018-09-25 07:46:07.989926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-09-25 07:46:08.140031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-25 07:46:08.140055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-09-25 07:46:08.140059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-09-25 07:46:08.140224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10009 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) Traceback (most recent call last): File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800 [[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./train.py", line 96, in sess.run(train_op) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800 [[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] Exit 1

matpalm commented 6 years ago

thanks for waiting jono, i still haven't had a chance to look at this yet... hopefully this afternoon the planets will align for some free time :D

squeakus commented 6 years ago

No rush! I only get a chance to look at it at the weekend atm

squeakus commented 5 years ago

I am going to try rewriting the code over the weekend to work with my images, is the NCS_POC still the latest version or should I be working off the master branch?

matpalm commented 5 years ago

Yeah. I still haven't merged it back yet sorry (since it also needs some clean up) but it demonstrates the things I needed to do. Good luck!

Luxonis-Brandon commented 4 years ago

Wow, super excited that you got this working on the NCS as well. So at some point I'm going to try to get this running on our DepthAI platform (here) so that you can know the physical location in cartesian coordinates (x,y,z) in centimeters of the bees - so to be able to map their 3D flight patterns.