tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 709 forks source link

ResourceExhaustedError When Running Evaluator #1350

Closed HugoPu closed 4 years ago

HugoPu commented 4 years ago

I created a bert tfx pipeline project which was based on this tutorial, it worked normally until the Evaluator steps.

Enviroment:

OS: ubuntu 16.04
Tensorflow Version:1.15
Graphics Card:RTX 2080 TI

The code of Evaluator is as follow:

evaluator = Evaluator( 
        examples=example_gen.outputs['examples'], 
        model=trainer.outputs['model'],
        feature_slicing_spec=evaluator_pb2.FeatureSlicingSpec(specs=[ 
            evaluator_pb2.SingleSlicingSpec() 
        ]))
context.run(evaluator)

The error log is as follow:

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1000,12,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node bert/encoder/layer_0/attention/self/Softmax}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[mean/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat/_477]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1000,12,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node bert/encoder/layer_0/attention/self/Softmax}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Did it mean that the gpu had no enough RAM? But I set both the train_batch_size and the eval_batch_size to be 1, and set session_config.gpu_options.per_process_gpu_memory_fraction = 0.5, it still didn't work. And I found the shape of the variables are [1000, 12 256, 256], it is strange, what are 1000 and 12 from?

Did I make any mistake? Could you give me some advise?

numerology commented 4 years ago

Hi @HugoPu , can you confirm the TFX version you're using (especially the TFMA)? Thanks

HugoPu commented 4 years ago

Hi @HugoPu , can you confirm the TFX version you're using (especially the TFMA)? Thanks

Hi @numerology, the version info is as follow:

tensorboard                1.15.0
tensorflow                 1.15.2
tensorflow-data-validation 0.15.0
tensorflow-estimator       1.15.1
tensorflow-metadata        0.15.2
tensorflow-model-analysis  0.15.4
tensorflow-serving-api     1.15.0
tensorflow-text            1.15.0
tensorflow-transform       0.15.0
tfx                        0.15.0
HugoPu commented 4 years ago

It is the same issue which is mentioned in #236, and 1000 is the batch_size. Rewrite the executors of Evaluator and ModelValidator can fix this issure

andyliangdong commented 4 years ago

I meet this problem too.