nnstreamer model 환경 구축 및 Squeezenet 학습

yura1h commented 4 years ago

목표: nnstreamer simple example(mobile_ssd_v2_coco.tflite) test 및 squeezenet 모델 학습 결과물 : nnstreamer 우분투 환경 구축과 build issue, dependency issue 해결, nnstreamer올라갈 수 있는 squeezenet 모델 due date: by 03/08

yura1h commented 4 years ago

##아래와 같은 Error 발생으로 모델학습에 어려움을 겪고 있습니다. tensorflow 버전 문제인지 cudnn 문제인지ㅠ 설치환경 바꿔서 돌려보는데 잘 안되서 조언 부탁드립니다##

nvidia-driver: 440.44
CUDA_v10.2
Cudnn_v7
Tensorflow_v2.1.0
python_v3.6.9 ////////////////////////////////////////////// bash /////////////////////////////////////////////////////////////////////// (cv) modulabs-04@modulabs-ROG-Strix-G731GW-G731GW:~/AIcollege/ondevicemodel/SqueezeNet$ sudo python3 train_squeezenet.py [sudo] password for modulabs-04: 2020-03-08 01:15:14.741060: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6 2020-03-08 01:15:14.742095: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6 /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:38: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use tf.keras.layers.Conv2D instead. WARNING:tensorflow:From /home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use layer.__call__ method instead. WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:61: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.MaxPooling2D instead. WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:77: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:81: average_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.AveragePooling2D instead. strating training 2020-03-08 01:15:17.023983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-03-08 01:15:17.051795: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.052315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2020-03-08 01:15:17.052379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-03-08 01:15:17.052451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-03-08 01:15:17.053674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-03-08 01:15:17.053939: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-03-08 01:15:17.055084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-03-08 01:15:17.055739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-03-08 01:15:17.055763: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-03-08 01:15:17.055847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.056359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.056786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-03-08 01:15:17.057120: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-03-08 01:15:17.081380: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz 2020-03-08 01:15:17.082468: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x70297c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-03-08 01:15:17.082513: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-03-08 01:15:17.171972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.172330: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x70bf700 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-03-08 01:15:17.172345: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5 2020-03-08 01:15:17.172474: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.172733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2020-03-08 01:15:17.172754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-03-08 01:15:17.172761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-03-08 01:15:17.172772: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-03-08 01:15:17.172781: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-03-08 01:15:17.172789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-03-08 01:15:17.172796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-03-08 01:15:17.172802: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-03-08 01:15:17.172831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.173141: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.173383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-03-08 01:15:17.173404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-03-08 01:15:17.361479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-03-08 01:15:17.361527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-03-08 01:15:17.361534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-03-08 01:15:17.361769: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.362116: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-03-08 01:15:17.362429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7144 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5) 2020-03-08 01:15:19.493788: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-03-08 01:15:20.189719: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-08 01:15:20.192070: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-08 01:15:20.193265: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-08 01:15:20.194263: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-08 01:15:20.195279: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-08 01:15:20.196446: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call return fn(*args) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn target_list, run_metadata) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node squeezenet_v0/conv1/conv2d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_squeezenet.py", line 182, in train(sq_net,lr_rate,max_iter,out_classes,batch_size,tr_data_files,tr_labels,cv_data_files,cv_labels,log_file) File "train_squeezenet.py", line 110, in train sess.run([sq_net.v0_opt,sq_net.v0_res_opt,sq_net.v1_opt],feed_dict={sq_net.inputs:batch_images,sq_net.labels:batch_labels,sq_net.lr_rate:lr_rate}) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run run_metadata_ptr) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run feed_dict_tensor, options, run_metadata) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run run_metadata) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node squeezenet_v0/conv1/conv2d/Conv2D (defined at /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:38) ]]

Errors may have originated from an input operation. Input Source operations connected to node squeezenet_v0/conv1/conv2d/Conv2D: Placeholder (defined at /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:51)

Original stack trace for 'squeezenet_v0/conv1/conv2d/Conv2D': File "train_squeezenet.py", line 180, in sq_net = SqueezeNet(input_shape,out_classes,lr_rate,is_train) File "/home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py", line 53, in init self.loss_v0,self.loss_v0_res,self.loss_v1 = self.model_loss(self.inputs,self.labels,train) File "/home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py", line 162, in model_loss logits_v0 = self.model_arc_v0(inputs,train) File "/home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py", line 60, in model_arc_v0 conv1 = general_conv(inputs,filters=96,kernel=7,stride=2,padding="SAME",name="conv1",relu=True,weight="Xavier") File "/home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py", line 38, in general_conv conv = tf.layers.conv2d(inputs,filters,kernel,stride,padding,kernel_initializer=w_init) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/layers/convolutional.py", line 424, in conv2d return layer.apply(inputs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func return func(*args, *kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1672, in apply return self.call(inputs, args, kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 547, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 778, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, args, kwargs, options=options) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 459, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/convolutional.py", line 209, in call outputs = self._convolution_op(inputs, self.kernel) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1135, in call return self.conv_op(inp, filter) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 640, in call return self.call(inp, filter) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 239, in call name=self.name) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2011, in conv2d name=name) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 969, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal op_def=op_def) File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init self._traceback = tf_stack.extract_stack()

hayleyshim commented 4 years ago

저 위의 cuda 등 환경세팅과 관해 기존에 설치된거 밀고 아까 알려드린 람다스택으로 다시 환경 세팅하고 그 위에서 모델 학습해보셨나요 *참고 : 람다스택 https://lambdalabs.com/lambda-stack-deep-learning-software

tf 버전문제, cudnn 문제 등등 언급해주신것들이 환경세팅과 관련된거같아 람다스택으로 한번에 세팅해서 그위에서 모델 학습 시켜보는게 가장 빠르지않을까싶어요.

hayleyshim commented 4 years ago

Errors may have originated from an input operation. Input Source operations connected to node squeezenet_v0/conv1/conv2d/Conv2D:

이쪽 부분과 관련해서 비슷한 에러를 찾아보면 https://github.com/tensorflow/tensorflow/issues/24650 이렇게 나오는데 tf version 문제일까요

ddeokho commented 4 years ago

Screenshot from 2020-03-08 11-16-59

저도 예전에 설치할 때 버전 조건을 만족시키지 못해 꽤 고생했었는데 TF 2.1 이상 쓰려면 cuDNN을 7.6이상 써야할 거 같아요.

https://www.tensorflow.org/install/gpu

그리고 cudnn에 cuda 10.2 를 지원하는 게 없네요. cuda 버전도 낮춰야 할 거 같아요 https://developer.nvidia.com/rdp/cudnn-archive

jwkanggist commented 4 years ago

다른 이슈에서도 언급드렸지만 이미 학습되서 제공되는 모델로 먼저 연결해서 전체가 돌아가는지 확인하시는걸 추천드립니다

https://github.com/nnstreamer-preprocessor/nnstreamer/issues/5#issuecomment-596180448

그리고 그와 동일한 모델을 학습하는 파이프라인을 만들고 대체해서 같은 점수가 나오는지 확인하고 모델 개선해나가는 방향으로 진행하시는걸 추천합니다 @H-YURA

nnstreamer-preprocessor / nnstreamer

nnstreamer model 환경 구축 및 Squeezenet 학습 #8