peterlee0127 / tensorflow-nvJetson

TensorFlow for NVIDIA Jetson, also include patch and script for building.
https://tfjetson.peterlee.app
205 stars 61 forks source link

Status of Tensorflow 2.0 for Jetson. #29

Closed peterlee0127 closed 5 years ago

peterlee0127 commented 5 years ago

Most of other compile issues resolved.

NVIDIA Jetson TX2
JetPack 3.3
python 3.5
tensorflow-2.0.0a0-cp35-cp35m-linux_aarch64.whl

Current Issue. Cuda_driver adds the check of gpu memory(Jetson GPU don't have a vram, it share the memory). I don't have any idea to fix it now... It's seems from tensorflow 1.13. This version use the CUDA 10.

2019-03-08 19:32:56.481099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021]      0
2019-03-08 19:32:56.481123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0:   N
2019-03-08 19:32:56.481220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2264 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-03-08 19:33:01.647196: F tensorflow/stream_executor/cuda/cuda_driver.cc:1184] Check failed: PointerIsValid(gpu_dst) Destination pointer is not actually on GPU: 67666247680
[1]    27238 abort (core dumped)  python3 tensorflow-nvJetson/tf-test/test_tftrt.py
(.tensorflow) ➜  ~ python3 tensorflow-nvJetson/tf-test/gpu.py
Limited tf.compat.v2.summary API due to missing TensorBoard installation
2019-03-08 20:03:58.971054: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-03-08 20:03:59.022948: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:976] ARM64 does not support NUMA - returning NUMA node zero
2019-03-08 20:03:59.023976: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-03-08 20:03:59.024117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1467] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 2.66GiB
2019-03-08 20:03:59.024172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1546] Adding visible gpu devices: 0
2019-03-08 20:03:59.024270: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.9.0
2019-03-08 20:03:59.024721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1015] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-08 20:03:59.024755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021]      0
2019-03-08 20:03:59.024783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0:   N
2019-03-08 20:03:59.024932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2419 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2
2019-03-08 20:03:59.025870: I tensorflow/core/common_runtime/direct_session.cc:316] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-03-08 20:03:59.028368: I tensorflow/core/common_runtime/placer.cc:61] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-03-08 20:03:59.028451: I tensorflow/core/common_runtime/placer.cc:61] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-03-08 20:03:59.028492: I tensorflow/core/common_runtime/placer.cc:61] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
2019-03-08 20:04:02.965650: F tensorflow/stream_executor/cuda/cuda_driver.cc:1184] Check failed: PointerIsValid(gpu_dst) Destination pointer is not actually on GPU: 67660218368
[1]    27628 abort (core dumped)  python3 tensorflow-nvJetson/tf-test/gpu.py
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0316 22:30:39.139987 547609665536 deprecation.py:506] From /home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/training/slot_creator.py:187: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-03-16 22:31:37.686594: F tensorflow/core/kernels/random_op_gpu.cu.cc:64] Non-OK-status: CudaLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: unknown error
Fatal Python error: Aborted

Thread 0x0000007f49500200 (most recent call first):
  File "/usr/lib/python3.5/threading.py", line 293 in wait
  File "/usr/lib/python3.5/queue.py", line 164 in get
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x0000007f49d00200 (most recent call first):
  File "/usr/lib/python3.5/threading.py", line 293 in wait
  File "/usr/lib/python3.5/queue.py", line 164 in get
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 159 in run
  File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x0000007f80146000 (most recent call first):
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1408 in _call_tf_sessionrun
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1320 in _run_fn
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335 in _do_call
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1329 in _do_run
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1153 in _run
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 930 in run
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 5448 in _run_using_default_session
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2616 in run
  File "tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py", line 143 in train
  File "tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py", line 187 in main
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/absl/app.py", line 251 in _run_main
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/absl/app.py", line 300 in run
  File "/home/nvidia/.tensorflow/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 40 in run
  File "tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py", line 214 in <module>
[1]    3310 abort (core dumped)  python3 tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.p
peterlee0127 commented 5 years ago

This issue is fixed. It was a mistake of wrong config.

uniphix000 commented 4 years ago

sorry to comment on a closed issue, but i gou this error 'F ./tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid configuration argument' you have mentioned that was a mistake of wrong config, please give me more information to fix this bug and thanks

peterlee0127 commented 4 years ago

@uniphix000 . Which version of Jetpack and tensorflow ? Build from source or use my pre-build ?

uniphix000 commented 4 years ago

im not farmilar with jetpack but im using tensorflow-gpu==2.0.0 installed by pip

uniphix000 commented 4 years ago

ps python3.5 CUDA 10.0

peterlee0127 commented 4 years ago

For tensorflow 2.0 and later. try to adding --config=nonccl

bazel build --local_resources 5048,6,1.0 --config=nonccl --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package.

I think use pip to install will not work on jetson platform.

uniphix000 commented 4 years ago

my fault, im not running on a jetson flatform but a hpc sever with several GPUs(RTX 2080ti). OS is Red Hat.

peterlee0127 commented 4 years ago

@uniphix000 I think you can install tensorflow by pip directly. Maybe your CUDA/CUDNN env has some issues. https://www.tensorflow.org/install This project is for running tensorflow on NVIDIA jetson platform.

uniphix000 commented 4 years ago

thank you for your advice