isobar-us / multilabel-image-classification-tensorflow

MIT License
46 stars 25 forks source link

Tensorflow requirements? #2

Closed vade closed 5 years ago

vade commented 5 years ago

Hello

Firstly, thanks for putting this work out in the wild! Im curious to get an environment set up to run a multi label classification training session - and from experience know some times there's particularities on versions.

Was this built / run against a particular version of Tensorflow? Any other requirements?

Thank you!

rhossei2 commented 5 years ago

Hello,

we've tested this project against Tensorflow 1.9.0. It may run fine against higher versions of Tensorflow but we haven't tested it ourselves.

vade commented 5 years ago

thank you!

acg93-pixel commented 5 years ago

works well for me on tf1.12!

rishabhindoria commented 5 years ago

@rhossei2 @anacej with tf-gpu-1.12 version (FROM tensorflow/tensorflow:1.12.0-gpu-py3) as base in dockerfile.gpu sagemaker training completes but artifact uploading step fails with exit code -6, any idea?

rhossei2 commented 5 years ago

@rishabhindoria in my experience artifact upload failures are due to insufficient disk space on your Sagemaker training instance. Try adding a few additional gigs of space from the configuration page of your training job.

rishabhindoria commented 5 years ago

@rhossei2 i have 30gb of space for job and have tried it with larger space too, but same error. Have you tried running your same repo on sagemaker?

Failure reason AlgorithmError: Exception during training: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120'] Traceback (most recent call last): File "/opt/ml/code/train", line 102, in <module> commandline_util.run_python_script(training_script, default_params) File "/opt/ml/code/utils/commandline_util.py", line 34, in run_python_script run(script_cmd) File "/opt/ml/code/utils/commandline_util.py", line 27, in run raise Exception(error_msg) Exception: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']

rhossei2 commented 5 years ago

@rishabhindoria we use this repo internally and works fine for us. To help I need more info from you:

  1. What errors do you see in Cloudwatch logs for your training job?
  2. What type of training instance do you use? (ml.p2.xlarge, etc.)
rishabhindoria commented 5 years ago

@rhossei2 sure, its ml.p2.xlarge also have attached full logs, it seems training is not even started, no idea whats going on

Using Tensorflow version: 1.12.0
Loaded training parameters: {'num_steps': '120'}
Setting number of steps to 120
Setting quantization to False
Setting image shape to 1,300,300,3
Setting inference type to FLOAT
Extracted checkpoint files: ['model.ckpt.data-00000-of-00001', 'checkpoint', '.ipynb_checkpoints', 'model.ckpt.meta', 'model.ckpt.index', 'configuration.config']
Starting the training...
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7faa68e13158>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/builders/dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/builders/dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/predictors/heads/box_head.py:93: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:2298: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From /opt/ml/code/tensorflow-models/research/object_detection/core/losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
2019-05-27 07:53:47.314165: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-27 07:53:47.504619: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-27 07:53:47.505063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-05-27 07:53:47.505098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-27 07:53:49.556909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 07:53:49.556955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-27 07:53:49.556963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-05-27 07:53:49.557259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10759 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Exception during training: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']
Traceback (most recent call last):
File "/opt/ml/code/train", line 102, in <module>
commandline_util.run_python_script(training_script, default_params)
File "/opt/ml/code/utils/commandline_util.py", line 34, in run_python_script
run(script_cmd)
File "/opt/ml/code/utils/commandline_util.py", line 27, in run
raise Exception(error_msg)
Exception: Return Code: -11, CMD: ['/usr/bin/python3', '/opt/ml/code/tensorflow-models/research/object_detection/model_main.py', '--model_dir', '/opt/ml/model', '--pipeline_config_path', '/opt/ml/input/data/config/configuration.config', '--num_train_steps', '120']
rhossei2 commented 5 years ago

@rishabhindoria Do you mind posting your configuration.config file along with your training job's "channel" setting values (train, validation, etc.) ?

rishabhindoria commented 5 years ago

@rhossei2 sure

from sagemaker.estimator import Estimator
from sagemaker import get_execution_role
from sagemaker.session import s3_input
role = get_execution_role()

channels=dict()
channels["train"]=s3_input("s3://***/train.record")
channels["validation"]=s3_input("s3://***/val.record")
channels["label"]=s3_input("s3://***/label_map.pbtxt")
channels["config"]=s3_input("s3://***/faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/configuration.config")
channels["checkpoint"]=s3_input("s3://***/faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28")

ecr_image = ‘***’

hyperparameters = {'num_steps': 120}

instance_type = 'ml.p2.xlarge'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=ecr_image,
                      hyperparameters=hyperparameters,
                      output_path="s3://***/output")

estimator.fit(channels)

############################################################################

# Faster R-CNN with Inception Resnet v2, Atrous version;
# Configured for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 10
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 500
        max_dimension: 500
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.4
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 2
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.01
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/opt/ml/input/data/checkpoint/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 500 steps, which we
  # empirically found to be sufficient enough to train the dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 36000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/opt/ml/input/data/train/train.record"
  }
  label_map_path: "/opt/ml/input/data/label/label_map.pbtxt"
}

eval_config: {
  num_examples: 20
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
  metrics_set: "coco_detection_metrics"
  include_metrics_per_category: true
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/opt/ml/input/data/validation/val.record"
  }
  label_map_path: "/opt/ml/input/data/label/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
rhossei2 commented 5 years ago

Thanks for info!

looking at what you provided I feel like the problem might lie in the checkpoint files. It looks like you replaced the checkpoint's pipeline.config with your own configuration.config? You'll need to keep those checkpoint files as they are when you download them and provide your own configuration.config file in a separate location. Basically when you download the checkpoint files, don't modify or delete them.

rishabhindoria commented 5 years ago

@rhossei2 got this error this time after moving file tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) :(

rhossei2 commented 5 years ago

@rishabhindoria you moved configuration.config? if so, that's good, but what I was also suggesting is that I noticed you were missing pipeline.config that comes with your checkpoint. Each checkpoint folder you download from Tensorflow's Github comes with a pipeline.config which you shouldn't delete or modify. I noticed that by looking through your log file: Extracted checkpoint files: ['model.ckpt.data-00000-of-00001', 'checkpoint', '.ipynb_checkpoints', 'model.ckpt.meta', 'model.ckpt.index', 'configuration.config'] - notice how there is no pipeline.config. I have a feeling that's preventing checkpoint from being restored.

rishabhindoria commented 5 years ago

@rhossei2 actually even moving it not required, ya i know Tensorflow's Github comes with a pipeline.config and in our case pipeline.config and configuration.config are one and the same thing. The code looks at absolute paths in whichever file you provide anywhere ie from /opt/ml/input/data/checkpoint/model.ckpt, /opt/ml/input/data/train/train.record and /opt/ml/input/data/label/label_map.pbtxt paths for eg, just cant figure out how is it working for you?

the actual training starts from this cmd

tfrecord_config_path = os.path.join(input_path, 'config/configuration.config')
default_params = ['--model_dir', str(model_path),
                          '--pipeline_config_path', str(tfrecord_config_path),
                          '--num_train_steps', str(num_steps_hyperparam)]
print('Starting the training...')

so pipeline.config is not even in the picture here

you are not even using anything from extracted files and its not required too tfrecord_pretrained_checkpoint_path = os.path.join(input_path, 'checkpoint/')

so tfrecord_config_path = os.path.join(input_path, 'config/configuration.config') is all that matters, it will download checkpoint folder automatically in /opt/ml/checkpoint folder which your tf code while training can access it from /opt/ml/input/data/checkpoint/model.ckpt directly and load required 'model.ckpt.data-00000-of-00001', 'checkpoint', 'model.ckpt.meta', 'model.ckpt.index' from /opt/ml/checkpoint

rhossei2 commented 5 years ago

@rishabhindoria you can try another model architecture like one of the SSD's to see if your issue is isolated to faster_rcnn_inception_resnet_v2. We've tested against most of the SSD's and faster_rcnn_resnet50_coco without a problem.

rishabhindoria commented 5 years ago

@rhossei2 not yet, i trained the same model with tf.1.13.1-gpu-py3 after people suggested to use it for this error tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) and training completed successfully but creating endpoint failed as sagemaker does not have cuda drivers for cuda 10.0!!

rishabhindoria commented 5 years ago

@rhossei2 raising support ticket with them now

rhossei2 commented 5 years ago

@rishabhindoria I was actually upgrading our docker images to tf 1.13.1-gpu few days back and I too noticed Sagemaker is not supporting Cuda 10 and that made me sad. They claim they're in the process of adding support for it but I'm not sure when.

rishabhindoria commented 5 years ago

@rhossei2 just checked they fixed it, now tf 1.13.1-gpu works, so might wanna close this?

rhossei2 commented 5 years ago

@rishabhindoria thanks for the update! closing