Frozen pretrained Faster RCNN/RFCN networks from model zoo yielding different outputs on different GPUs and runs

EpochalEngineer commented 7 years ago

System information

What is the top-level directory of the model you are using: Using unmodified pretrained coco models: faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017, faster_rcnn_resnet101_coco_11_06_2017, rfcn_resnet101_coco_11_06_2017
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): UPDATE: tested on two machines now, both reproduce it: Machine 1: Linux Ubuntu 14.04.4 LTS Machine 2: Linux Ubuntu 16.04.2 LTS
TensorFlow installed from (source or binary): official docker container, with last commit 58fb6d7e257f28cd7934316d6ae7a81ec42a533a docker version from 2017-08-24T02:37:57.51182742Z
TensorFlow version (use command below): ('v1.2.0-5-g435cdfc', '1.2.1')
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: From official docker: CUDA 8., cuDNN 5.1.10
GPU model and memory: Machine 1: Three nVIDIA GeForce GTX 1080, 12 GB Machine 2: Two nVIDIA GeForce GTX 1080, 12 GB
Exact command to reproduce: Running object_detection_tutorial.ipynb with different GPUs, either with export CUDA_VISIBLE_DEVICES=, or by setting it in the session config. Version that runs through 3 GPUs several times and compares output is included.

Describe the problem

Running on different GPUs yields different results, and GPUs 1 and 2 are not deterministic. This is accomplished by making devices 1,2 invisible, and tensorflow runs on 0, and so forth. This is using frozen pretrained networks from this repository's linked model zoo and the supplied object_detection_tutorial.ipynb with no modifications other than setting the cuda visible_device_list. The SSD frozen models, however, give identical outputs on the 3 GPUs from what I have seen.

I have also run cuda_memtest on all 3 GPUs, logs attached

UPDATE: I just tested on a second machine with 2 GPUs, and reproduced the issue. GPU 0 is deterministic, GPU 1 is not (and often produces bad results).

Source code / logs

I've attached the diff of the modified object_detection_tutorial.ipynb which loops over 3 GPUs 3 times and prints out the top box scores, which change depending on the run. Also attached is a PDF of that ipynb with detections drawn on it. Text output:

Evaluating image 0

Running on GPU 0 Top 4 box scores: Iter 1: [ 0.99978215 0.99857557 0.95300484 0.91580492] Iter 2: [ 0.99978215 0.99857557 0.95300484 0.91580492] Iter 3: [ 0.99978215 0.99857557 0.95300484 0.91580492]

Running on GPU 1 Top 4 box scores: Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629] Iter 2: [ 0.18502565 0.16854601 0.08074528 0.07859289] Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05111229]

Running on GPU 2 Top 4 box scores: Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629] Iter 2: [ 0.18941374 0.18502565 0.16854601 0.16230994] Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05482833]

Evaluating image 1

Running on GPU 0 Top 4 box scores: Iter 1: [ 0.99755412 0.99750346 0.99380219 0.99067008] Iter 2: [ 0.99755412 0.99750346 0.99380219 0.99067008] Iter 3: [ 0.99755412 0.99750346 0.99380219 0.99067008]

Running on GPU 1 Top 4 box scores: Iter 1: [ 0.96881998 0.96441168 0.96164131 0.96006596] Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978] Iter 3: [ 0.90396696 0.89217037 0.85456908 0.85334581]

Running on GPU 2 Top 4 box scores: Iter 1: [ 0.9377929 0.91686022 0.80374646 0.79758978] Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978] Iter 3: [ 0.9377929 0.91686022 0.80374646 0.79758978]

object_detection_tutorial.diff.txt

gpu_output_differences.pdf

Updated with longer run: cuda_memtest.log.txt

EpochalEngineer commented 7 years ago

Updated with a simplified test with model_zoo and second machine test that reproduced these issues.

EpochalEngineer commented 7 years ago

@aselle Was there supposed to be a response added with the removal of that tag?

aselle commented 7 years ago

@nealwu, could you take a look?

nealwu commented 7 years ago

Looks like this is an object detection question. Looping in @derekjchow @jch1

EpochalEngineer commented 7 years ago

Noticed a difference in using an environment variable CUDA_VISIBLE_DEVICES vs setting the config parameter. We're no longer able to reproduce this behavior with the environment variable, only with the config parameter. In addition, when using the config parameter, there is a small ~180 MB task on GPU0 when the config file is set to use GPU[1,2], which seems to correlate with these issues.

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflow / models