tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

*bug* Hyperparameter tuning not yield best parameters #825

Open hadyan-tvlk opened 6 years ago

hadyan-tvlk commented 6 years ago

Hi guys,

I'm trying to run hyperparameter tuning from example: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/09_sequence/poetry.ipynb

From the example, it's expected for us to get best parameters from specific trial.

{
      "trialId": "37",
      "hyperparameters": {
        "hp_num_hidden_layers": "4",
        "hp_learning_rate": "0.026711152525921437",
        "hp_hidden_size": "512",
        "hp_attention_dropout": "0.60589466163419292"
      },
      "finalMetric": {
        "trainingStep": "8000",
        "objectiveValue": 0.0276162791997
      }
}

But, after running it using the latest version of T2T, it doesn't show anything like above JSON but just score for each trial. Is there anything i'm missing? Thanks in advance

rsepassi commented 6 years ago

Can you provide the command you used to launch and what the JSON look like for the various runs in the ML Engine dashboard for that job?

hadyan-tvlk commented 6 years ago

Hi @rsepassi,

thanks for the response. I'm using exactly command that provided in tutorial

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model_hparam
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "Y" | t2t-trainer \
  --data_dir=gs://${BUCKET}/poetry/subset \
  --t2t_usr_dir=./poetry/trainer \
  --problem=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_poetry \
  --output_dir=$OUTDIR \
  --hparams_range=transformer_poetry_range \
  --autotune_objective='metrics-poetry_line_problem/accuracy_per_sequence' \
  --autotune_maximize \
  --autotune_max_trials=4 \
  --autotune_parallel_trials=4 \
  --train_steps=7500 --cloud_mlengine --worker_gpu=4

The JSON logs is looks fine for each trial. nothing special. I'll give you the last lines of logs to show that there is not best score params yielded into the end of the logs

1

I had already ran it several times and still get nothing.

rsepassi commented 6 years ago

Ah, it won’t be in each job’s logs but rather in the entry in the ML engine dashboard for the whole hyperparameter tuning. On Wed, May 23, 2018 at 6:27 PM Mochammad Sidqi Hadyan < notifications@github.com> wrote:

Hi @rsepassi https://github.com/rsepassi,

thanks for the response. I'm using exactly command that provided in tutorial

DATADIR=gs://${BUCKET}/poetry/data OUTDIR=gs://${BUCKET}/poetry/modelhparam JOBNAME=poetry$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR echo "Y" | t2t-trainer \ --data_dir=gs://${BUCKET}/poetry/subset \ --t2t_usr_dir=./poetry/trainer \ --problems=$PROBLEM \ --model=transformer \ --hparams_set=transformer_poetry \ --output_dir=$OUTDIR \ --hparams_range=transformer_poetry_range \ --autotune_objective='metrics-poetry_line_problem/accuracy_per_sequence' \ --autotune_maximize \ --autotune_max_trials=4 \ --autotune_parallel_trials=4 \ --train_steps=7500 --cloud_mlengine --worker_gpu=4

The JSON logs is looks fine for each trial. nothing special. I'll give you the last lines of logs to show that there is not best score params yielded into the end of the logs

[image: 1] https://user-images.githubusercontent.com/34705256/40459548-3712be1e-5f2c-11e8-9ad6-e377b507b6ec.png

I had already ran it several times and still get nothing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/825#issuecomment-391558129, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW2lZuRcPnncEKfxE5D7GqhfiFSXmks5t1gyFgaJpZM4UJp0f .

hadyan-tvlk commented 6 years ago

Ah i see, do you mean this @rsepassi?

Parameter Input

{
  "scaleTier": "CUSTOM",
  "masterType": "complex_model_m_p100",
  "packageUris": [
    "gs://test_t2t/poetry/model_hparam/tensor2tensor_tmp.tar.gz",
    "gs://test_t2t/poetry/model_hparam/t2t_usr_container.tar.gz"
  ],
  "pythonModule": "tensor2tensor.bin.t2t_trainer",
  "args": [
    "--eval_steps=100",
    "--cloud_tpu=False",
    "--hparams_range=transformer_poetry_range",
    "--decode_hparams=",
    "--sync=False",
    "--eval_run_autoregressive=False",
    "--eval_use_test_set=False",
    "--only_use_ae_for_policy=False",
    "--worker_id=0",
    "--eval_early_stopping_metric_minimize=True",
    "--worker_replicas=1",
    "--worker_gpu_memory_fraction=0.95",
    "--train_steps=2000",
    "--cloud_tpu_name=test-tpu",
    "--locally_shard_to_cpu=False",
    "--iterations_per_loop=100",
    "--registry_help=False",
    "--worker_gpu=4",
    "--keep_checkpoint_max=20",
    "--save_checkpoints_secs=0",
    "--gpu_order=",
    "--master=",
    "--generate_data=False",
    "--intra_op_parallelism_threads=0",
    "--enable_graph_rewriter=False",
    "--eval_early_stopping_metric=loss",
    "--output_dir=gs://test_t2t/poetry/model_hparam",
    "--profile=False",
    "--ps_job=/job:ps",
    "--tmp_dir=/tmp/t2t_datagen",
    "--schedule=continuous_train_and_eval",
    "--inter_op_parallelism_threads=0",
    "--hparams=",
    "--use_tpu=False",
    "--eval_early_stopping_metric_delta=0.1",
    "--ps_gpu=0",
    "--tfdbg=False",
    "--local_eval_frequency=1000",
    "--data_dir=gs://test_t2t/poetry/subset",
    "--ps_replicas=0",
    "--export_saved_model=False",
    "--problem=poetry_line_problem",
    "--log_device_placement=False",
    "--hparams_set=transformer_poetry",
    "--dbgprofile=False",
    "--timit_paths=",
    "--cloud_skip_confirmation=False",
    "--cloud_delete_on_done=False",
    "--tpu_num_shards=8",
    "--cloud_vm_name=test-vm",
    "--parsing_path=",
    "--worker_job=/job:localhost",
    "--model=transformer",
    "--keep_checkpoint_every_n_hours=10000",
    "--t2t_usr_dir",
    "t2t_usr_dir_internal"
  ],
  "hyperparameters": {
    "goal": "MAXIMIZE",
    "params": [
      {
        "parameterName": "hp_hidden_size",
        "type": "DISCRETE",
        "discreteValues": [
          128,
          256,
          512
        ]
      },
      {
        "parameterName": "hp_learning_rate",
        "minValue": 0.05,
        "maxValue": 0.25,
        "type": "DOUBLE",
        "scaleType": "UNIT_LOG_SCALE"
      },
      {
        "parameterName": "hp_attention_dropout",
        "minValue": 0.4,
        "maxValue": 0.7,
        "type": "DOUBLE"
      },
      {
        "parameterName": "hp_num_hidden_layers",
        "minValue": 2,
        "maxValue": 4,
        "type": "INTEGER"
      }
    ],
    "maxTrials": 4,
    "maxParallelTrials": 4,
    "hyperparameterMetricTag": "metrics-poetry_line_problem/accuracy_per_sequence"
  },
  "region": "asia-east1",
  "runtimeVersion": "1.8",
  "jobDir": "gs://test_t2t/poetry/model_hparam",
  "pythonVersion": "2.7"
}

Parameter Output

{
  "completedTrialCount": "4",
  "trials": [
    {
      "trialId": "1",
      "hyperparameters": {
        "hp_hidden_size": "128",
        "hp_learning_rate": "0.10203632059457049",
        "hp_num_hidden_layers": "4",
        "hp_attention_dropout": "0.52901200589059827"
      }
    },
    {
      "trialId": "2",
      "hyperparameters": {
        "hp_attention_dropout": "0.64617604866780931",
        "hp_hidden_size": "256",
        "hp_learning_rate": "0.18905077512294322",
        "hp_num_hidden_layers": "4"
      }
    },
    {
      "trialId": "3",
      "hyperparameters": {
        "hp_attention_dropout": "0.58885243185235137",
        "hp_hidden_size": "128",
        "hp_learning_rate": "0.10596887917921334",
        "hp_num_hidden_layers": "4"
      }
    },
    {
      "trialId": "4",
      "hyperparameters": {
        "hp_attention_dropout": "0.59207490095311122",
        "hp_hidden_size": "128",
        "hp_learning_rate": "0.06655300061633318",
        "hp_num_hidden_layers": "4"
      }
    }
  ],
  "consumedMLUnits": 25.32,
  "isHyperparameterTuningJob": true
}
rsepassi commented 6 years ago

yes. do you see the metric you specified with the autotune objective flag somewhere on that page? On Thu, May 24, 2018 at 12:07 AM Mochammad Sidqi Hadyan < notifications@github.com> wrote:

Ah i see, do you mean this @rsepassi https://github.com/rsepassi?

Parameter Input

{ "scaleTier": "CUSTOM", "masterType": "complex_model_m_p100", "packageUris": [ "gs://test_t2t/poetry/model_hparam/tensor2tensor_tmp.tar.gz", "gs://test_t2t/poetry/model_hparam/t2t_usr_container.tar.gz" ], "pythonModule": "tensor2tensor.bin.t2t_trainer", "args": [ "--eval_steps=100", "--cloud_tpu=False", "--hparams_range=transformer_poetry_range", "--decode_hparams=", "--sync=False", "--eval_run_autoregressive=False", "--eval_use_test_set=False", "--only_use_ae_for_policy=False", "--worker_id=0", "--eval_early_stopping_metric_minimize=True", "--worker_replicas=1", "--worker_gpu_memory_fraction=0.95", "--train_steps=2000", "--cloud_tpu_name=test-tpu", "--locally_shard_to_cpu=False", "--iterations_per_loop=100", "--registry_help=False", "--worker_gpu=4", "--keep_checkpoint_max=20", "--save_checkpoints_secs=0", "--gpu_order=", "--master=", "--generate_data=False", "--intra_op_parallelism_threads=0", "--enable_graph_rewriter=False", "--eval_early_stopping_metric=loss", "--output_dir=gs://test_t2t/poetry/model_hparam", "--profile=False", "--ps_job=/job:ps", "--tmp_dir=/tmp/t2t_datagen", "--schedule=continuous_train_and_eval", "--inter_op_parallelism_threads=0", "--hparams=", "--use_tpu=False", "--eval_early_stopping_metric_delta=0.1", "--ps_gpu=0", "--tfdbg=False", "--local_eval_frequency=1000", "--data_dir=gs://test_t2t/poetry/subset", "--ps_replicas=0", "--export_saved_model=False", "--problem=poetry_line_problem", "--log_device_placement=False", "--hparams_set=transformer_poetry", "--dbgprofile=False", "--timit_paths=", "--cloud_skip_confirmation=False", "--cloud_delete_on_done=False", "--tpu_num_shards=8", "--cloud_vm_name=test-vm", "--parsing_path=", "--worker_job=/job:localhost", "--model=transformer", "--keep_checkpoint_every_n_hours=10000", "--t2t_usr_dir", "t2t_usr_dir_internal" ], "hyperparameters": { "goal": "MAXIMIZE", "params": [ { "parameterName": "hp_hidden_size", "type": "DISCRETE", "discreteValues": [ 128, 256, 512 ] }, { "parameterName": "hp_learning_rate", "minValue": 0.05, "maxValue": 0.25, "type": "DOUBLE", "scaleType": "UNIT_LOG_SCALE" }, { "parameterName": "hp_attention_dropout", "minValue": 0.4, "maxValue": 0.7, "type": "DOUBLE" }, { "parameterName": "hp_num_hidden_layers", "minValue": 2, "maxValue": 4, "type": "INTEGER" } ], "maxTrials": 4, "maxParallelTrials": 4, "hyperparameterMetricTag": "metrics-poetry_line_problem/accuracy_per_sequence" }, "region": "asia-east1", "runtimeVersion": "1.8", "jobDir": "gs://test_t2t/poetry/model_hparam", "pythonVersion": "2.7" }

Parameter Output

{ "completedTrialCount": "4", "trials": [ { "trialId": "1", "hyperparameters": { "hp_hidden_size": "128", "hp_learning_rate": "0.10203632059457049", "hp_num_hidden_layers": "4", "hp_attention_dropout": "0.52901200589059827" } }, { "trialId": "2", "hyperparameters": { "hp_attention_dropout": "0.64617604866780931", "hp_hidden_size": "256", "hp_learning_rate": "0.18905077512294322", "hp_num_hidden_layers": "4" } }, { "trialId": "3", "hyperparameters": { "hp_attention_dropout": "0.58885243185235137", "hp_hidden_size": "128", "hp_learning_rate": "0.10596887917921334", "hp_num_hidden_layers": "4" } }, { "trialId": "4", "hyperparameters": { "hp_attention_dropout": "0.59207490095311122", "hp_hidden_size": "128", "hp_learning_rate": "0.06655300061633318", "hp_num_hidden_layers": "4" } } ], "consumedMLUnits": 25.32, "isHyperparameterTuningJob": true }

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/825#issuecomment-391611722, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW_yroYL3zT1RqI78XMu1oP9j-ptbks5t1lwbgaJpZM4UJp0f .

hadyan-tvlk commented 6 years ago

@rsepassi yes i can see it from JSON training input specification

"maxTrials": 4,
    "maxParallelTrials": 4,
    "hyperparameterMetricTag": "metrics-poetry_line_problem/accuracy_per_sequence"
hadyan-tvlk commented 6 years ago

Sorry @rsepassi to ping you again. I'm still unable to solve this issue. Any idea?

orimosenzonkami commented 5 years ago

I have the same problem...