Closed lamdawr closed 3 years ago
If running in training mode, maybe reduce the size of the input shuffling.
Make it something like 64 or 32 to see if memory usage got improved.
Hi,
Thanks for the reply.
I tried that already but it didn't work.
Thanks Lakshmi
On Fri, Feb 7, 2020, 4:54 PM yeqingli notifications@github.com wrote:
If running in training mode, maybe reduce the size of the input shuffling.
Make it something like 64 or 32 to see if memory usage got improved.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .
Hi, Thanks for the reply. I tried that already but it didn't work. Thanks Lakshmi … On Fri, Feb 7, 2020, 4:54 PM yeqingli @.***> wrote: If running in training mode, maybe reduce the size of the input shuffling. https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99 Make it something like 64 or 32 to see if memory usage got improved. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .
Do you see any changes in memory usage?
No change in the memory usage. It kept climbing to 250gb and then I force stopped.
On Fri, Feb 7, 2020, 4:58 PM yeqingli notifications@github.com wrote:
Hi, Thanks for the reply. I tried that already but it didn't work. Thanks Lakshmi … <#m-8989963450216101933> On Fri, Feb 7, 2020, 4:54 PM yeqingli @.***> wrote: If running in training mode, maybe reduce the size of the input shuffling. https://github.com/tensorflow/models/blob/67f6015a23741e3934b6641a1e3687aa1e73bf23/official/vision/detection/dataloader/input_reader.py#L99 Make it something like 64 or 32 to see if memory usage got improved. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8119 https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7K33QASGYIZSGV6MTRBXREDA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE46LI#issuecomment-583651117>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG2A2AD2QY5NLSG7VZLRBXREDANCNFSM4KRT4NKQ .
Do you see any changes in memory usage?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG7CL7JLLF3PDWSVYIDRBXRQFA5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE5EWI#issuecomment-583651929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG6KASSPW327IJ3HMW3RBXRQFANCNFSM4KRT4NKQ .
Could you provide the detailed command line and config that you used to reproduce the issue? Did you run on GPU or CPU?
retinanet_config :
RESNET50_FROZEN_VARPREFIX = r'(resnet\d+/)conv2d(|([1-9]|10))\/' RESNET_FROZEN_VARPREFIX = r'(resnet\d+)\/(conv2d(|([1-9]|10))|batchnormalization(|([1-9]|10)))\/'
RETINANET_CFG = {
'type': 'retinanet',
'model_dir': '',
'use_tpu': False,
'strategy_type': 'mirrored',
'train': {
'batch_size': 1,
'iterations_per_loop': 500,
'total_steps': 22500,
'optimizer': {
'type': 'momentum',
'momentum': 0.9,
'nesterov': True, # False
is better for TPU v3-128.
},
'learning_rate': {
'type': 'step',
'warmup_learning_rate': 0.0067,
'warmup_steps': 500,
'init_learning_rate': 0.08,
'learning_rate_levels': [0.008, 0.0008],
'learning_rate_steps': [15000, 20000],
},
'checkpoint': {
'path': '',
'prefix': '',
},
'frozen_variable_prefix': RESNET50_FROZEN_VAR_PREFIX,
'train_file_pattern': '',
'transpose_input': False,
'l2_weight_decay': 0.0001,
'input_sharding': True,
},
'eval': {
'batch_size': 1,
'min_eval_interval': 180,
'eval_timeout': None,
'eval_samples': 1750,
'type': 'box',
'val_json_file': '',
'eval_file_pattern': '',
'input_sharding': True,
},
'predict': {
'predict_batch_size': 1,
},
'architecture': {
'parser': 'retinanet_parser',
'backbone': 'resnet',
'multilevel_features': 'fpn',
'use_bfloat16': False,
},
'anchor': {
'min_level': 3,
'max_level': 7,
'num_scales': 3,
'aspect_ratios': [1.0, 2.0, 0.5],
'anchor_size': 4.0,
},
'retinanet_parser': {
'use_bfloat16': False,
'output_size': [640, 640],
'num_channels': 3,
'match_threshold': 0.5,
'unmatched_threshold': 0.5,
'aug_rand_hflip': True,
'aug_scale_min': 1.0,
'aug_scale_max': 1.0,
'use_autoaugment': False,
'autoaugment_policy_name': 'v0',
'skip_crowd_during_training': True,
'max_num_instances': 100,
},
'resnet': {
'resnet_depth': 50,
'dropblock': {
'dropblock_keep_prob': None,
'dropblock_size': None,
},
'batch_norm': {
'batch_norm_momentum': 0.997,
'batch_norm_epsilon': 1e-4,
'batch_norm_trainable': True,
},
},
'fpn': {
'min_level': 3,
'max_level': 7,
'fpn_feat_dims': 256,
'use_separable_conv': False,
'use_batch_norm': True,
'batch_norm': {
'batch_norm_momentum': 0.997,
'batch_norm_epsilon': 1e-4,
'batch_norm_trainable': True,
},
},
'nasfpn': {
'min_level': 3,
'max_level': 7,
'fpn_feat_dims': 256,
'num_repeats': 5,
'use_separable_conv': False,
'dropblock': {
'dropblock_keep_prob': None,
'dropblock_size': None,
},
'batch_norm': {
'batch_norm_momentum': 0.997,
'batch_norm_epsilon': 1e-4,
'batch_norm_trainable': True,
},
},
'retinanet_head': {
'min_level': 3,
'max_level': 7,
# Note that `num_classes` is the total number of classes including
# one background classes whose index is 0.
'num_classes': 200,
'anchors_per_location': 9,
'retinanet_head_num_convs': 4,
'retinanet_head_num_filters': 256,
'use_separable_conv': False,
'batch_norm': {
'batch_norm_momentum': 0.997,
'batch_norm_epsilon': 1e-4,
'batch_norm_trainable': True,
},
},
'retinanet_loss': {
'num_classes': 200,
'focal_loss_alpha': 0.25,
'focal_loss_gamma': 1.5,
'huber_loss_delta': 0.1,
'box_loss_weight': 50,
},
'postprocess': {
'use_batched_nms': False,
'min_level': 3,
'max_level': 7,
'max_total_size': 100,
'nms_iou_threshold': 0.5,
'score_threshold': 0.05,
'pre_nms_num_boxes': 5000,
},
'enable_summary': True,
}
RETINANET_RESTRICTIONS = [ 'architecture.use_bfloat16 == retinanet_parser.use_bfloat16', 'anchor.min_level == retinanet_head.min_level', 'anchor.max_level == retinanet_head.max_level', 'anchor.min_level == postprocess.min_level', 'anchor.max_level == postprocess.max_level', 'retinanet_head.num_classes == retinanet_loss.num_classes', ]
my_retinanet.yaml
type: 'retinanet' train: train_file_pattern: "/IMG2/object_detection/tf_records/107/train.tfrecords-?????-of-00010" eval: eval_file_pattern: "/IMG2/object_detection/tf_records/107/val.tfrecords-?????-of-00010"
start_model_train.sh
cd /home/user1/tf_2/models
export PYTHONPATH=$PYTHONPATH:pwd
:pwd
/slim
MODEL_DIR="/IMG2/test/tf_2_test/tf_models/107/245" python official/vision/detection/main.py \ --strategy_type="mirrored" \ --model_dir="${MODEL_DIR?}" \ --mode=train \ --config_file="official/vision/detection/my_retinanet.yaml"
On Fri, Feb 7, 2020, 5:09 PM yeqingli notifications@github.com wrote:
Could you provide the detailed command line and config that you used to reproduce the issue? Did you run on GPU or CPU?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/8119?email_source=notifications&email_token=ABZPWG3GZ34TTPPRGWHO5PTRBXSY7A5CNFSM4KRT4NK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELE53EA#issuecomment-583654800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZPWG3HHW4KOVBO5EOWI6DRBXSY7ANCNFSM4KRT4NKQ .
Hello all, was wondering if anybody could find any solution to the problem ?
Hi all,
Updating to tf 2.1 has helped with the memory ... yay !!!
But I am still not able to train. Any help appreciated :
I0211 20:13:12.788090 139968991819584 distributed_executor.py:411] Training started
Traceback (most recent call last):
File "official/vision/detection/main.py", line 250, in dataset.cache().take(k).repeat()
. You should use dataset.take(k).cache().repeat()
instead.
2020-02-11 20:25:58.680960: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
How can you make this Object Detection API work with TF2, I tought it's for TF1.x only?!
For the NaN issue, could you try use some training tricks for stabilize the training process. Like using the gradient_clip_norm, or smaller learning rate. Debugging NaN issue is usually difficult. It would be easier if you can trace to the original tensor whose numerics were exploded.
Close this issue since this issue is not memory anymore. We can use the other issue to track the NaN issue.
System information
Have I written custom code (as opposed to using example directory): https://github.com/tensorflow/models/tree/master/official/vision/detection changed strategy to "mirrored" to run in GPU/CPU, reduced batch size to 1 to start with minimum and changed location to my custom tfrecord. all other code unchanged. OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04.2 LTS bionic TensorFlow backend (yes / no): yes TensorFlow version: 2.0 Keras version: 2.3.1 Python version: 3.7.3 CUDA/cuDNN version: N/A GPU model and memory: N/A Describe the current behavior: eats up all the server memory. (>240gb) Loading a model once and results in continually increasing memory usage
Describe the expected behavior Calling model train should not result in any permanent increase in memory usage
Code to reproduce the issue: same code available in https://github.com/tensorflow/models/tree/master/official/vision/detection except for running it locally with strategy as "mirrored"