Open sniper0110 opened 3 years ago
Check whether the GPU is detected by the Tensorflow using the Python code in the link. If it is not detected, then it may be caused due to the conflict in the CUDA version and the Tensorflow version in the Google Cloud. Waiting for your Reply.
How do I check this when I don't have access to the code used on AI platform for the object detection API? I think it's a docker image. If it was my code or my docker image then I know how to check that, but because it's not mine, I am finding it difficult to verify anything.
I am facing the similar issue. The command I ran is the same as mentioned above. The command I ran -
gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 2.1 \
--python-version 3.7 \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-central1 \
--scale-tier CUSTOM \
--master-machine-type n1-highcpu-16 \
--master-accelerator count=2,type=nvidia-tesla-v100 \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
Here is a screenshot -
The GPU utilization is 0 and the training is taking 5 hrs than the usual 2 hrs.
PS: Does this mean that I am being charged for the gpu even when the utilization is 0?
Any updates?
I am experiencing the same issue. Has anyone found a solution?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/object_detection
2. Describe the bug
I am running the training of some models (SSD for object detection and Mask RCNN for segmentation) on AI platform. The training works fine but it is not using the GPUs on AI platform even though I am choosing a set of GPUs when I run my training. It's not a problem of quotas because I checked the quota of GPUs available for me to use and I am using just that.
3. Steps to reproduce
Prepare dataset and config file and then run the training job using this command :
The job starts and finishes just fine but it's not using the GPUs. When I look at the GPU usage for my training job, I see this:
As you can see, the usage is at 0% all the time.
4. Expected behavior
I expected the training to run on the GPU.
5. Additional context
The full logs from my training job :
6. System information