Open Esaada opened 6 years ago
HPARAMS should be one of these:
First of all, thanks, it worked. But, it doesn't run on my GPU (and I'm sure I have +CUDDA install), what am I doing wrong?!
Although I got those line: "INFO:tensorflow:worker_gpu=1 INFO:tensorflow:sync=False WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine. INFO:tensorflow:datashard_devices: ['gpu:0'] INFO:tensorflow:caching_devices: None INFO:tensorflow:ps_devices: ['gpu:0'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f5dd6a750>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95"
The reason I suspect of not using the GPU is when I'm doing nvidia-smi-> I'm seeing almost no use of the GPU memory: 289MiB/122285MiB, and the fact that I'm using tons of CPU.
you can "pip uninstall tensorflow==1.10.0" ,then update your tensorflow-gpu to the newest one.
Thanks, it worked also, I know it because now I have new and bigger problems. first I got this:
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.25G (1339080448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
The run didn't crashed, and I got this: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
still didn't crashed, and I got this in the end: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
and this, right after: E tensorflow/stream_executor/cuda/cuda_dnn.cc:353] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
andddd then it crashed. Looked online, stil can't find an helpful solution. I deacrease my batch size to 1.
i was suffering from the same issue for a many weeks and finally
i got that
i ran using window10 and windows server
using powershell
" .\nvidia-smi -q -i 0 -d SUPPORTED_CLOCKS"
and see that NVidia driver is using Cuda 10.2
when try to downgrade my NVIDIA DRIVER to Cuda 10.1 or CUDA 10.0 it finally worked
seem there is an issue with Supported CUDA10.2
Description
Trying to train Transformer model, After following the instructions, I ran the training command, and the training stuck at the beginning at "saving checkpoint" phase:
INFO:tensorflow:Saving checkpoints for 0 into ./t2t_train/librispeech/transformer-transformer_base_single_gpu/model.ckpt.
when I say stuck, I mean 24 hours in that phase.
Environment information