Open yura1h opened 4 years ago
##아래와 같은 Error 발생으로 모델학습에 어려움을 겪고 있습니다. tensorflow 버전 문제인지 cudnn 문제인지ㅠ 설치환경 바꿔서 돌려보는데 잘 안되서 조언 부탁드립니다##
tf.keras.layers.Conv2D
instead.
WARNING:tensorflow:From /home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/layers/convolutional.py:424: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.__call__
method instead.
WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:61: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:77: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:81: average_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.AveragePooling2D instead.
strating training
2020-03-08 01:15:17.023983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-08 01:15:17.051795: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.052315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-03-08 01:15:17.052379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-08 01:15:17.052451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-08 01:15:17.053674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-08 01:15:17.053939: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-08 01:15:17.055084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-08 01:15:17.055739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-08 01:15:17.055763: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-08 01:15:17.055847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.056359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.056786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-03-08 01:15:17.057120: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-08 01:15:17.081380: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2020-03-08 01:15:17.082468: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x70297c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-08 01:15:17.082513: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-03-08 01:15:17.171972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.172330: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x70bf700 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-08 01:15:17.172345: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-03-08 01:15:17.172474: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.172733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-03-08 01:15:17.172754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-08 01:15:17.172761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-08 01:15:17.172772: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-08 01:15:17.172781: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-08 01:15:17.172789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-08 01:15:17.172796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-08 01:15:17.172802: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-08 01:15:17.172831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.173141: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.173383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-03-08 01:15:17.173404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-08 01:15:17.361479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-08 01:15:17.361527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-03-08 01:15:17.361534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-03-08 01:15:17.361769: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.362116: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-08 01:15:17.362429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7144 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-03-08 01:15:19.493788: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-08 01:15:20.189719: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-08 01:15:20.192070: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-08 01:15:20.193265: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-08 01:15:20.194263: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-08 01:15:20.195279: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-08 01:15:20.196446: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/home/modulabs-04/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node squeezenet_v0/conv1/conv2d/Conv2D}}]]During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_squeezenet.py", line 182, in
Errors may have originated from an input operation. Input Source operations connected to node squeezenet_v0/conv1/conv2d/Conv2D: Placeholder (defined at /home/modulabs-04/AIcollege/ondevicemodel/SqueezeNet/squeezenet_model.py:51)
Original stack trace for 'squeezenet_v0/conv1/conv2d/Conv2D':
File "train_squeezenet.py", line 180, in
저 위의 cuda 등 환경세팅과 관해 기존에 설치된거 밀고 아까 알려드린 람다스택으로 다시 환경 세팅하고 그 위에서 모델 학습해보셨나요 *참고 : 람다스택 https://lambdalabs.com/lambda-stack-deep-learning-software
tf 버전문제, cudnn 문제 등등 언급해주신것들이 환경세팅과 관련된거같아 람다스택으로 한번에 세팅해서 그위에서 모델 학습 시켜보는게 가장 빠르지않을까싶어요.
Errors may have originated from an input operation. Input Source operations connected to node squeezenet_v0/conv1/conv2d/Conv2D:
이쪽 부분과 관련해서 비슷한 에러를 찾아보면 https://github.com/tensorflow/tensorflow/issues/24650 이렇게 나오는데 tf version 문제일까요
저도 예전에 설치할 때 버전 조건을 만족시키지 못해 꽤 고생했었는데 TF 2.1 이상 쓰려면 cuDNN을 7.6이상 써야할 거 같아요.
https://www.tensorflow.org/install/gpu
그리고 cudnn에 cuda 10.2 를 지원하는 게 없네요. cuda 버전도 낮춰야 할 거 같아요 https://developer.nvidia.com/rdp/cudnn-archive
다른 이슈에서도 언급드렸지만 이미 학습되서 제공되는 모델로 먼저 연결해서 전체가 돌아가는지 확인하시는걸 추천드립니다
https://github.com/nnstreamer-preprocessor/nnstreamer/issues/5#issuecomment-596180448
그리고 그와 동일한 모델을 학습하는 파이프라인을 만들고 대체해서 같은 점수가 나오는지 확인하고 모델 개선해나가는 방향으로 진행하시는걸 추천합니다 @H-YURA
목표: nnstreamer simple example(mobile_ssd_v2_coco.tflite) test 및 squeezenet 모델 학습 결과물 : nnstreamer 우분투 환경 구축과 build issue, dependency issue 해결, nnstreamer올라갈 수 있는 squeezenet 모델 due date: by 03/08