Closed vade closed 5 years ago
Hello,
we've tested this project against Tensorflow 1.9.0. It may run fine against higher versions of Tensorflow but we haven't tested it ourselves.
thank you!
works well for me on tf1.12!
@rhossei2 @anacej with tf-gpu-1.12 version (FROM tensorflow/tensorflow:1.12.0-gpu-py3) as base in dockerfile.gpu sagemaker training completes but artifact uploading step fails with exit code -6, any idea?
@rishabhindoria in my experience artifact upload failures are due to insufficient disk space on your Sagemaker training instance. Try adding a few additional gigs of space from the configuration page of your training job.
@rhossei2 i have 30gb of space for job and have tried it with larger space too, but same error. Have you tried running your same repo on sagemaker?
Failure reason AlgorithmError: Exception during training: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120'] Traceback (most recent call last): File "/opt/ml/code/train", line 102, in <module> commandline_util.run_python_script(training_script, default_params) File "/opt/ml/code/utils/commandline_util.py", line 34, in run_python_script run(script_cmd) File "/opt/ml/code/utils/commandline_util.py", line 27, in run raise Exception(error_msg) Exception: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']
@rishabhindoria we use this repo internally and works fine for us. To help I need more info from you:
@rhossei2 sure, its ml.p2.xlarge also have attached full logs, it seems training is not even started, no idea whats going on
Using Tensorflow version: 1.12.0
Loaded training parameters: {'num_steps': '120'}
Setting number of steps to 120
Setting quantization to False
Setting image shape to 1,300,300,3
Setting inference type to FLOAT
Extracted checkpoint files: ['model.ckpt.data-00000-of-00001', 'checkpoint', '.ipynb_checkpoints', 'model.ckpt.meta', 'model.ckpt.index', 'configuration.config']
Starting the training...
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7faa68e13158>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/builders/dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/builders/dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/predictors/heads/box_head.py:93: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:2298: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/core/losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
2019-05-27 07:53:47.314165: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-27 07:53:47.504619: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-27 07:53:47.505063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-05-27 07:53:47.505098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-27 07:53:49.556909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 07:53:49.556955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-27 07:53:49.556963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-05-27 07:53:49.557259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10759 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Exception during training: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']
Traceback (most recent call last):
File "/opt/ml/code/train", line 102, in <module>
commandline_util.run_python_script(training_script, default_params)
File "/opt/ml/code/utils/commandline_util.py", line 34, in run_python_script
run(script_cmd)
File "/opt/ml/code/utils/commandline_util.py", line 27, in run
raise Exception(error_msg)
Exception: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']
@rishabhindoria Do you mind posting your configuration.config file along with your training job's "channel" setting values (train, validation, etc.) ?
@rhossei2 sure
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role
from sagemaker.session import s3_input
role = get_execution_role()
channels=dict()
channels["train"]=s3_input("s3://***/train.record")
channels["validation"]=s3_input("s3://***/val.record")
channels["label"]=s3_input("s3://***/label_map.pbtxt")
channels["config"]=s3_input("s3://***/faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/configuration.config")
channels["checkpoint"]=s3_input("s3://***/faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28")
ecr_image = ‘***’
hyperparameters = {'num_steps': 120}
instance_type = 'ml.p2.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters,
output_path="s3://***/output")
estimator.fit(channels)
############################################################################
# Faster R-CNN with Inception Resnet v2, Atrous version;
# Configured for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 10
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 500
max_dimension: 500
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.4
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 2
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.01
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/opt/ml/input/data/checkpoint/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 500 steps, which we
# empirically found to be sufficient enough to train the dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 36000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/opt/ml/input/data/train/train.record"
}
label_map_path: "/opt/ml/input/data/label/label_map.pbtxt"
}
eval_config: {
num_examples: 20
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
metrics_set: "coco_detection_metrics"
include_metrics_per_category: true
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/opt/ml/input/data/validation/val.record"
}
label_map_path: "/opt/ml/input/data/label/label_map.pbtxt"
shuffle: false
num_readers: 1
}
Thanks for info!
looking at what you provided I feel like the problem might lie in the checkpoint files. It looks like you replaced the checkpoint's pipeline.config
with your own configuration.config
? You'll need to keep those checkpoint files as they are when you download them and provide your own configuration.config
file in a separate location. Basically when you download the checkpoint files, don't modify or delete them.
@rhossei2 got this error this time after moving file tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)
:(
@rishabhindoria you moved configuration.config
? if so, that's good, but what I was also suggesting is that I noticed you were missing pipeline.config
that comes with your checkpoint. Each checkpoint folder you download from Tensorflow's Github comes with a pipeline.config
which you shouldn't delete or modify. I noticed that by looking through your log file: Extracted checkpoint files: ['model.ckpt.data-00000-of-00001', 'checkpoint', '.ipynb_checkpoints', 'model.ckpt.meta', 'model.ckpt.index', 'configuration.config']
- notice how there is no pipeline.config
. I have a feeling that's preventing checkpoint from being restored.
@rhossei2 actually even moving it not required, ya i know Tensorflow's Github comes with a pipeline.config
and in our case pipeline.config
and configuration.config
are one and the same thing. The code looks at absolute paths in whichever file you provide anywhere ie from /opt/ml/input/data/checkpoint/model.ckpt
, /opt/ml/input/data/train/train.record
and /opt/ml/input/data/label/label_map.pbtxt
paths for eg, just cant figure out how is it working for you?
the actual training starts from this cmd
tfrecord_config_path = os.path.join(input_path, 'config/configuration.config')
default_params = ['--model_dir', str(model_path),
'--pipeline_config_path', str(tfrecord_config_path),
'--num_train_steps', str(num_steps_hyperparam)]
print('Starting the training...')
so pipeline.config is not even in the picture here
you are not even using anything from extracted files and its not required too tfrecord_pretrained_checkpoint_path = os.path.join(input_path, 'checkpoint/')
so tfrecord_config_path = os.path.join(input_path, 'config/configuration.config')
is all that matters, it will download checkpoint
folder automatically in /opt/ml/checkpoint
folder which your tf code while training can access it from /opt/ml/input/data/checkpoint/model.ckpt
directly and load required 'model.ckpt.data-00000-of-00001', 'checkpoint', 'model.ckpt.meta', 'model.ckpt.index' from /opt/ml/checkpoint
@rishabhindoria you can try another model architecture like one of the SSD's to see if your issue is isolated to faster_rcnn_inception_resnet_v2
. We've tested against most of the SSD's and faster_rcnn_resnet50_coco
without a problem.
@rhossei2 not yet, i trained the same model with tf.1.13.1-gpu-py3 after people suggested to use it for this error tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum)
and training completed successfully but creating endpoint failed as sagemaker does not have cuda drivers for cuda 10.0!!
@rhossei2 raising support ticket with them now
@rishabhindoria I was actually upgrading our docker images to tf 1.13.1-gpu few days back and I too noticed Sagemaker is not supporting Cuda 10 and that made me sad. They claim they're in the process of adding support for it but I'm not sure when.
@rhossei2 just checked they fixed it, now tf 1.13.1-gpu works, so might wanna close this?
@rishabhindoria thanks for the update! closing
Hello
Firstly, thanks for putting this work out in the wild! Im curious to get an environment set up to run a multi label classification training session - and from experience know some times there's particularities on versions.
Was this built / run against a particular version of Tensorflow? Any other requirements?
Thank you!