microsoft / AutonomousDrivingCookbook

Scenarios, tutorials and demos for Autonomous Driving
MIT License
2.32k stars 566 forks source link

Help! training not starting! #urgent #124

Open danialvi opened 2 years ago

danialvi commented 2 years ago

Capture

My training is not starting. I have used python 3.6 with tensorflow gpu 1.8.0 and keras 2.1.2. Also I have a Geforce GTX 3060 running on my computer. So it shouldnt be a problem. I also installed Norton antivirus on this new computer. On the older computer which has a bad GPU I had Panda Dome, but there training was running. But after over 1 hour, the training was only on 1%. Thats why I bought a new computer with a good GPU and CPU. Some of this work is going to be presented in my master thesis. I would appreciate any help soon.

danialvi commented 2 years ago

I got this error now:


InternalError Traceback (most recent call last) C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, args) 1321 try: -> 1322 return fn(args) 1323 except errors.OpError as e:

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1306 return self._call_tf_sessionrun( -> 1307 options, feed_dict, fetch_list, target_list, run_metadata) 1308

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1408 self._session, options, feed_dict, fetch_list, target_list, -> 1409 run_metadata) 1410 else:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

InternalError Traceback (most recent call last)

in 1 history = model.fit_generator(train_generator, steps_per_epoch=num_train_examples//batch_size, epochs=500, callbacks=callbacks,\ ----> 2 validation_data=eval_generator, validation_steps=num_eval_examples//batch_size, verbose=2) C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs) 85 warnings.warn('Update your `' + object_name + 86 '` call to the Keras 2 API: ' + signature, stacklevel=2) ---> 87 return func(*args, **kwargs) 88 wrapper._original_function = func 89 return wrapper C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 2145 outs = self.train_on_batch(x, y, 2146 sample_weight=sample_weight, -> 2147 class_weight=class_weight) 2148 2149 if not isinstance(outs, list): C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight) 1837 ins = x + y + sample_weights 1838 self._make_train_function() -> 1839 outputs = self.train_function(ins) 1840 if len(outputs) == 1: 1841 return outputs[0] C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs) 2355 session = get_session() 2356 updated = session.run(fetches=fetches, feed_dict=feed_dict, -> 2357 **self.session_kwargs) 2358 return updated[:len(self.outputs)] 2359 C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata) 898 try: 899 result = self._run(None, fetches, feed_dict, options_ptr, --> 900 run_metadata_ptr) 901 if run_metadata: 902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1133 if final_fetches or final_targets or (handle and feed_dict_tensor): 1134 results = self._do_run(handle, final_targets, final_fetches, -> 1135 feed_dict_tensor, options, run_metadata) 1136 else: 1137 results = [] C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1314 if handle is None: 1315 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1316 run_metadata) 1317 else: 1318 return self._do_call(_prun_fn, handle, feeds, fetches) C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args) 1333 except KeyError: 1334 pass -> 1335 raise type(e)(node_def, op, message) 1336 1337 def _extend_graph(self): InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'dense2/MatMul', defined at: File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel_launcher.py", line 16, in app.launch_new_instance() File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\traitlets\config\application.py", line 664, in launch_instance app.start() File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelapp.py", line 612, in start self.io_loop.start() File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\platform\asyncio.py", line 199, in start self.asyncio_loop.run_forever() File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\base_events.py", line 442, in run_forever self._run_once() File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\base_events.py", line 1462, in _run_once handle._run() File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\events.py", line 145, in _run self._callback(*self._args) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\ioloop.py", line 688, in lambda f: self._run_callback(functools.partial(callback, future)) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback ret = callback() File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 814, in inner self.ctx_run(self.run) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run return f(*args, **kw) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 775, in run yielded = self.gen.send(value) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one yield gen.maybe_future(dispatch(*args)) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper yielded = ctx_run(next, result) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run return f(*args, **kw) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell yield gen.maybe_future(handler(stream, idents, msg)) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper yielded = ctx_run(next, result) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run return f(*args, **kw) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 545, in execute_request user_expressions, allow_stdin, File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper yielded = ctx_run(next, result) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run return f(*args, **kw) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 2867, in run_cell raw_cell, store_history, silent, shell_futures) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 2895, in _run_cell return runner(coro) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner coro.send(None) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3072, in run_cell_async interactivity=interactivity, compiler=compiler, result=result) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3263, in run_ast_nodes if (await self.run_code(code, result, async_=asy)): File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 24, in merged = Dense(10, activation=activation, name='dense2')(merged) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\topology.py", line 603, in __call__ output = self.call(inputs, **kwargs) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\layers\core.py", line 843, in call output = K.dot(inputs, self.kernel) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\backend\tensorflow_backend.py", line 1057, in dot out = tf.matmul(x, y) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4278, in mat_mul name=name) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op op_def=op_def) File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
danialvi commented 2 years ago

@mitchellspryn please help

danialvi commented 2 years ago

@adshar

danialvi commented 2 years ago

depencies.txt Here is the list of dependencies I have in my anaconda env:

mitchellspryn commented 2 years ago

I am not at MSFT currently, so I am not actively supporting this repo any more.

That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]

I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.

danialvi commented 2 years ago

I am not at MSFT currently, so I am not actively supporting this repo any more.

That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]

I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.

Thank you for answering. I have tried to reinstall to check if it's something to do with cuda. I also tried by installing the cudatoolkit and cudann before install tensorflow by following these steps: conda install cudatoolkit=9.0 conda install cudnn=7.1.4=cuda9.0_0 conda install -c anaconda tensorflow-gpu=1.8.0 conda install -c anaconda keras-gpu=2.1.2 python -m pip install --upgrade pip conda update -n base conda pip install msgpack-rpc-python pip uninstall tornado conda install -c conda-forge tornado=4.5.3 conda install jupyter pip install matplotlib==2.1.2 pip install image pip install keras_tqdm conda install -c conda-forge opencv conda install pandas pip install --upgrade numpy==1.16.4 conda install scipy pip install opencv-python pip install --upgrade h5py==2.10.0 python -m ipykernel install --user

Still I have the same problem. Do you have any idea how I can solve this? I have really tried to look it up, but it seems many had the same problem, but no solutions that worked for me. As I am using this as a part of my master thesis, I have limited time as well.