tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.76k forks source link

Tf 2.0 object detection - all memory is getting used up >240gb #8119

Closed lamdawr closed 3 years ago

lamdawr commented 4 years ago

System information

Have I written custom code (as opposed to using example directory): https://github.com/tensorflow/models/tree/master/official/vision/detection changed strategy to "mirrored" to run in GPU/CPU, reduced batch size to 1 to start with minimum and changed location to my custom tfrecord. all other code unchanged. OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04.2 LTS bionic TensorFlow backend (yes / no): yes TensorFlow version: 2.0 Keras version: 2.3.1 Python version: 3.7.3 CUDA/cuDNN version: N/A GPU model and memory: N/A Describe the current behavior: eats up all the server memory. (>240gb) Loading a model once and results in continually increasing memory usage

Describe the expected behavior Calling model train should not result in any permanent increase in memory usage

Code to reproduce the issue: same code available in https://github.com/tensorflow/models/tree/master/official/vision/detection except for running it locally with strategy as "mirrored"

yeqingli commented 4 years ago

If running in training mode, maybe reduce the size of the input shuffling.

https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99

Make it something like 64 or 32 to see if memory usage got improved.

lamdawr commented 4 years ago

Hi,

Thanks for the reply.

I tried that already but it didn't work.

Thanks Lakshmi

On Fri, Feb 7, 2020, 4:54 PM yeqingli notifications@github.com wrote:

If running in training mode, maybe reduce the size of the input shuffling.

https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99

Make it something like 64 or 32 to see if memory usage got improved.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .

yeqingli commented 4 years ago

Hi, Thanks for the reply. I tried that already but it didn't work. Thanks Lakshmi On Fri, Feb 7, 2020, 4:54 PM yeqingli @.***> wrote: If running in training mode, maybe reduce the size of the input shuffling. https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99 Make it something like 64 or 32 to see if memory usage got improved. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .

Do you see any changes in memory usage?

lamdawr commented 4 years ago

No change in the memory usage. It kept climbing to 250gb and then I force stopped.

On Fri, Feb 7, 2020, 4:58 PM yeqingli notifications@github.com wrote:

Hi, Thanks for the reply. I tried that already but it didn't work. Thanks Lakshmi … <#m-8989963450216101933> On Fri, Feb 7, 2020, 4:54 PM yeqingli @.***> wrote: If running in training mode, maybe reduce the size of the input shuffling. https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99 Make it something like 64 or 32 to see if memory usage got improved. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8119 https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .

Do you see any changes in memory usage?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7CL7JLLF3PDWSVYIDRBXRQFA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE5EWI#issuecomment-583651929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG6KASSPW327IJ3HMW3RBXRQFANCNFSM4KRT4NKQ .

yeqingli commented 4 years ago

Could you provide the detailed command line and config that you used to reproduce the issue? Did you run on GPU or CPU?

lamdawr commented 4 years ago

retinanet_config :

RESNET50_FROZEN_VARPREFIX = r'(resnet\d+/)conv2d(|([1-9]|10))\/' RESNET_FROZEN_VARPREFIX = r'(resnet\d+)\/(conv2d(|([1-9]|10))|batchnormalization(|([1-9]|10)))\/'

pylint: disable=line-too-long

RETINANET_CFG = { 'type': 'retinanet', 'model_dir': '', 'use_tpu': False, 'strategy_type': 'mirrored', 'train': { 'batch_size': 1, 'iterations_per_loop': 500, 'total_steps': 22500, 'optimizer': { 'type': 'momentum', 'momentum': 0.9, 'nesterov': True, # False is better for TPU v3-128. }, 'learning_rate': { 'type': 'step', 'warmup_learning_rate': 0.0067, 'warmup_steps': 500, 'init_learning_rate': 0.08, 'learning_rate_levels': [0.008, 0.0008], 'learning_rate_steps': [15000, 20000], }, 'checkpoint': { 'path': '', 'prefix': '', }, 'frozen_variable_prefix': RESNET50_FROZEN_VAR_PREFIX, 'train_file_pattern': '',

TODO(b/142174042): Support transpose_input option.

    'transpose_input': False,
    'l2_weight_decay': 0.0001,
    'input_sharding': True,
},
'eval': {
    'batch_size': 1,
    'min_eval_interval': 180,
    'eval_timeout': None,
    'eval_samples': 1750,
    'type': 'box',
    'val_json_file': '',
    'eval_file_pattern': '',
    'input_sharding': True,
},
'predict': {
    'predict_batch_size': 1,
},
'architecture': {
    'parser': 'retinanet_parser',
    'backbone': 'resnet',
    'multilevel_features': 'fpn',
    'use_bfloat16': False,
},
'anchor': {
    'min_level': 3,
    'max_level': 7,
    'num_scales': 3,
    'aspect_ratios': [1.0, 2.0, 0.5],
    'anchor_size': 4.0,
},
'retinanet_parser': {
    'use_bfloat16': False,
    'output_size': [640, 640],
    'num_channels': 3,
    'match_threshold': 0.5,
    'unmatched_threshold': 0.5,
    'aug_rand_hflip': True,
    'aug_scale_min': 1.0,
    'aug_scale_max': 1.0,
    'use_autoaugment': False,
    'autoaugment_policy_name': 'v0',
    'skip_crowd_during_training': True,
    'max_num_instances': 100,
},
'resnet': {
    'resnet_depth': 50,
    'dropblock': {
        'dropblock_keep_prob': None,
        'dropblock_size': None,
    },
    'batch_norm': {
        'batch_norm_momentum': 0.997,
        'batch_norm_epsilon': 1e-4,
        'batch_norm_trainable': True,
    },
},
'fpn': {
    'min_level': 3,
    'max_level': 7,
    'fpn_feat_dims': 256,
    'use_separable_conv': False,
    'use_batch_norm': True,
    'batch_norm': {
        'batch_norm_momentum': 0.997,
        'batch_norm_epsilon': 1e-4,
        'batch_norm_trainable': True,
    },
},
'nasfpn': {
    'min_level': 3,
    'max_level': 7,
    'fpn_feat_dims': 256,
    'num_repeats': 5,
    'use_separable_conv': False,
    'dropblock': {
        'dropblock_keep_prob': None,
        'dropblock_size': None,
    },
    'batch_norm': {
        'batch_norm_momentum': 0.997,
        'batch_norm_epsilon': 1e-4,
        'batch_norm_trainable': True,
    },
},
'retinanet_head': {
    'min_level': 3,
    'max_level': 7,
    # Note that `num_classes` is the total number of classes including
    # one background classes whose index is 0.
    'num_classes': 200,
    'anchors_per_location': 9,
    'retinanet_head_num_convs': 4,
    'retinanet_head_num_filters': 256,
    'use_separable_conv': False,
    'batch_norm': {
        'batch_norm_momentum': 0.997,
        'batch_norm_epsilon': 1e-4,
        'batch_norm_trainable': True,
    },
},
'retinanet_loss': {
    'num_classes': 200,
    'focal_loss_alpha': 0.25,
    'focal_loss_gamma': 1.5,
    'huber_loss_delta': 0.1,
    'box_loss_weight': 50,
},
'postprocess': {
    'use_batched_nms': False,
    'min_level': 3,
    'max_level': 7,
    'max_total_size': 100,
    'nms_iou_threshold': 0.5,
    'score_threshold': 0.05,
    'pre_nms_num_boxes': 5000,
},
'enable_summary': True,

}

RETINANET_RESTRICTIONS = [ 'architecture.use_bfloat16 == retinanet_parser.use_bfloat16', 'anchor.min_level == retinanet_head.min_level', 'anchor.max_level == retinanet_head.max_level', 'anchor.min_level == postprocess.min_level', 'anchor.max_level == postprocess.max_level', 'retinanet_head.num_classes == retinanet_loss.num_classes', ]

my_retinanet.yaml

type: 'retinanet' train: train_file_pattern: "/IMG2/object_detection/tf_records/107/train.tfrecords-?????-of-00010" eval: eval_file_pattern: "/IMG2/object_detection/tf_records/107/val.tfrecords-?????-of-00010"

start_model_train.sh

!/bin/bash

cd /home/user1/tf_2/models export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim

MODEL_DIR="/IMG2/test/tf_2_test/tf_models/107/245" python official/vision/detection/main.py \ --strategy_type="mirrored" \ --model_dir="${MODEL_DIR?}" \ --mode=train \ --config_file="official/vision/detection/my_retinanet.yaml"

On Fri, Feb 7, 2020, 5:09 PM yeqingli notifications@github.com wrote:

Could you provide the detailed command line and config that you used to reproduce the issue? Did you run on GPU or CPU?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG3GZ34TTPPRGWHO5PTRBXSY7A5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE53EA#issuecomment-583654800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG3HHW4KOVBO5EOWI6DRBXSY7ANCNFSM4KRT4NKQ .

lamdawr commented 4 years ago

Hello all, was wondering if anybody could find any solution to the problem ?

lamdawr commented 4 years ago

Hi all,

Updating to tf 2.1 has helped with the memory ... yay !!!

But I am still not able to train. Any help appreciated : I0211 20:13:12.788090 139968991819584 distributed_executor.py:411] Training started Traceback (most recent call last): File "official/vision/detection/main.py", line 250, in app.run(main) File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "official/vision/detection/main.py", line 245, in main run() File "official/vision/detection/main.py", line 239, in run callbacks=callbacks) File "official/vision/detection/main.py", line 136, in run_executor save_config=True) File "/home/user1/tf_2/models/official/modeling/training/distributed_executor.py", line 427, in train raise ValueError('total loss is NaN.') ValueError: total loss is NaN. 2020-02-11 20:25:58.280294: W tensorflow/core/kernels/data/cache_dataset_ops.cc:822] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 2020-02-11 20:25:58.680960: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

haimat commented 4 years ago

How can you make this Object Detection API work with TF2, I tought it's for TF1.x only?!

yeqingli commented 3 years ago

For the NaN issue, could you try use some training tricks for stabilize the training process. Like using the gradient_clip_norm, or smaller learning rate. Debugging NaN issue is usually difficult. It would be easier if you can trace to the original tensor whose numerics were exploded.

Close this issue since this issue is not memory anymore. We can use the other issue to track the NaN issue.