SageMaker training with imdb sample data throws the error:
ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2])
Run log:
2020-03-28 01:35:52 Starting - Starting the training job...
2020-03-28 01:35:54 Starting - Launching requested ML instances......
2020-03-28 01:36:55 Starting - Preparing the instances for training......
2020-03-28 01:38:17 Downloading - Downloading input data...
2020-03-28 01:38:29 Training - Downloading the training image...............
2020-03-28 01:41:09 Training - Training image download completed. Training in progress.Starting the training.
/opt/ml/input/data/training/config/training_config.json
{'run_text': 'imdb_reviews', 'finetuned_model': None, 'do_lower_case': 'True', 'train_file': 'train_sample.csv', 'val_file': 'val_sample.csv', 'label_file': 'labels.csv', 'text_col': 'text', 'label_col': '["label"]', 'multi_label': 'False', 'grad_accumulation_steps': '1', 'fp16_opt_level': 'O1', 'fp16': 'False', 'model_type': 'roberta', 'model_name': 'roberta-base', 'logging_steps': '300'}
{'train_batch_size': '8', 'warmup_steps': '500', 'lr': '0.005', 'max_seq_length': '512', 'optimizer_type': 'lamb', 'lr_schedule': 'warmup_cosine', 'epochs': '6'}
03/28/2020 01:41:12 - INFO - root - model path used /opt/ml/code/pretrained_models/roberta-base
03/28/2020 01:41:12 - INFO - root - finetuned model not available - loading standard pretrained model
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Model name '/opt/ml/code/pretrained_models/roberta-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming '/opt/ml/code/pretrained_models/roberta-base' is a path, a model identifier, or url to a directory containing tokenizer files.
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/added_tokens.json. We won't load it.
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/special_tokens_map.json. We won't load it.
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/tokenizer_config.json. We won't load it.
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/vocab.json
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/merges.txt
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None
03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None
03/28/2020 01:41:12 - INFO - root - Number of GPUs: 4
03/28/2020 01:41:12 - INFO - root - label columns: ['label']
03/28/2020 01:41:12 - INFO - root - Writing example 0 of 100
03/28/2020 01:41:12 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_train_multi_label_512_train_sample.csv
03/28/2020 01:41:13 - INFO - root - Writing example 0 of 50
03/28/2020 01:41:13 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_dev_multi_label_512_val_sample.csv
03/28/2020 01:41:13 - INFO - root - databunch labels: 2
03/28/2020 01:41:13 - INFO - root - multilabel: True, multilabel type: <class 'bool'>
03/28/2020 01:41:13 - INFO - transformers.configuration_utils - loading configuration file /opt/ml/code/pretrained_models/roberta-base/config.json
03/28/2020 01:41:13 - INFO - transformers.configuration_utils - Model config RobertaConfig {
"_num_labels": 2,
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"early_stopping": false,
"eos_token_id": 2,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-05,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 514,
"min_length": 0,
"model_type": "roberta",
"no_repeat_ngram_size": 0,
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 1,
"pruned_heads": {},
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 1,
"use_bfloat16": false,
"vocab_size": 50265
}
/opt/ml/code/pretrained_models/roberta-base
<class 'str'>
03/28/2020 01:41:13 - INFO - transformers.modeling_utils - loading weights file /opt/ml/code/pretrained_models/roberta-base/pytorch_model.bin
03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights of RobertaForMultiLabelSequenceClassification not initialized from pretrained model: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights from pretrained model not used in RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
03/28/2020 01:41:24 - INFO - root - Running training
03/28/2020 01:41:24 - INFO - root - Num examples = 100
03/28/2020 01:41:24 - INFO - root - Num Epochs = 6
03/28/2020 01:41:24 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 32
03/28/2020 01:41:24 - INFO - root - Gradient Accumulation steps = 1
03/28/2020 01:41:24 - INFO - root - Total optimization steps = 24
Exception during training: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward
logits.view(-1, self.num_labels), labels.view(-1, self.num_labels)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
reduction=self.reduction)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(*new_args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))
Traceback (most recent call last):
File "/opt/ml/code/train", line 220, in train
optimizer_type=hyperparameters["optimizer_type"],
File "/opt/conda/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 368, in fit
outputs = self.model(inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward
logits.view(-1, self.num_labels), labels.view(-1, self.num_labels)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
reduction=self.reduction)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(new_args, kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))
█#015█#015
2020-03-28 01:41:45 Uploading - Uploading generated training model
2020-03-28 01:41:45 Failed - Training job failed
SageMaker training with imdb sample data throws the error: ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2])
hyperparameters = { "epochs": 6, "lr": 5e-3, "max_seq_length": 512, "train_batch_size": 8, "lr_schedule": "warmup_cosine", "warmup_steps": 500, "optimizer_type": "lamb" }
training_config = { "run_text": "imdb_reviews", "finetuned_model": None, "do_lower_case": "True", "train_file": "train_sample.csv", "val_file": "val_sample.csv", "label_file": "labels.csv", "text_col": "text", "label_col": '["label"]', "multi_label": "False", "grad_accumulation_steps": "1", "fp16_opt_level": "O1", "fp16": "False", "model_type": "roberta", "model_name": "roberta-base", "logging_steps": "300" }
estimator = sagemaker.estimator.Estimator(image, role, train_instance_count=1, train_instance_type='ml.p3.8xlarge', output_path=bucket_path_output, base_job_name='imdb-reviews', hyperparameters=hyperparameters, sagemaker_session=session )
Run log: 2020-03-28 01:35:52 Starting - Starting the training job... 2020-03-28 01:35:54 Starting - Launching requested ML instances...... 2020-03-28 01:36:55 Starting - Preparing the instances for training...... 2020-03-28 01:38:17 Downloading - Downloading input data... 2020-03-28 01:38:29 Training - Downloading the training image............... 2020-03-28 01:41:09 Training - Training image download completed. Training in progress.Starting the training. /opt/ml/input/data/training/config/training_config.json {'run_text': 'imdb_reviews', 'finetuned_model': None, 'do_lower_case': 'True', 'train_file': 'train_sample.csv', 'val_file': 'val_sample.csv', 'label_file': 'labels.csv', 'text_col': 'text', 'label_col': '["label"]', 'multi_label': 'False', 'grad_accumulation_steps': '1', 'fp16_opt_level': 'O1', 'fp16': 'False', 'model_type': 'roberta', 'model_name': 'roberta-base', 'logging_steps': '300'} {'train_batch_size': '8', 'warmup_steps': '500', 'lr': '0.005', 'max_seq_length': '512', 'optimizer_type': 'lamb', 'lr_schedule': 'warmup_cosine', 'epochs': '6'} 03/28/2020 01:41:12 - INFO - root - model path used /opt/ml/code/pretrained_models/roberta-base 03/28/2020 01:41:12 - INFO - root - finetuned model not available - loading standard pretrained model 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Model name '/opt/ml/code/pretrained_models/roberta-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming '/opt/ml/code/pretrained_models/roberta-base' is a path, a model identifier, or url to a directory containing tokenizer files. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/added_tokens.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/special_tokens_map.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/tokenizer_config.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/vocab.json 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/merges.txt 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - root - Number of GPUs: 4 03/28/2020 01:41:12 - INFO - root - label columns: ['label'] 03/28/2020 01:41:12 - INFO - root - Writing example 0 of 100 03/28/2020 01:41:12 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_train_multi_label_512_train_sample.csv 03/28/2020 01:41:13 - INFO - root - Writing example 0 of 50 03/28/2020 01:41:13 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_dev_multi_label_512_val_sample.csv 03/28/2020 01:41:13 - INFO - root - databunch labels: 2 03/28/2020 01:41:13 - INFO - root - multilabel: True, multilabel type: <class 'bool'> 03/28/2020 01:41:13 - INFO - transformers.configuration_utils - loading configuration file /opt/ml/code/pretrained_models/roberta-base/config.json 03/28/2020 01:41:13 - INFO - transformers.configuration_utils - Model config RobertaConfig { "_num_labels": 2, "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "do_sample": false, "early_stopping": false, "eos_token_id": 2, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-05, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 514, "min_length": 0, "model_type": "roberta", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 1, "pruned_heads": {}, "repetition_penalty": 1.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 1, "use_bfloat16": false, "vocab_size": 50265 }
/opt/ml/code/pretrained_models/roberta-base <class 'str'> 03/28/2020 01:41:13 - INFO - transformers.modeling_utils - loading weights file /opt/ml/code/pretrained_models/roberta-base/pytorch_model.bin 03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights of RobertaForMultiLabelSequenceClassification not initialized from pretrained model: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias'] 03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights from pretrained model not used in RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight'] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic 03/28/2020 01:41:24 - INFO - root - Running training 03/28/2020 01:41:24 - INFO - root - Num examples = 100 03/28/2020 01:41:24 - INFO - root - Num Epochs = 6 03/28/2020 01:41:24 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 32 03/28/2020 01:41:24 - INFO - root - Gradient Accumulation steps = 1 03/28/2020 01:41:24 - INFO - root - Total optimization steps = 24 Exception during training: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward logits.view(-1, self.num_labels), labels.view(-1, self.num_labels) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward reduction=self.reduction) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))
Traceback (most recent call last): File "/opt/ml/code/train", line 220, in train optimizer_type=hyperparameters["optimizer_type"], File "/opt/conda/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 368, in fit outputs = self.model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward logits.view(-1, self.num_labels), labels.view(-1, self.num_labels) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward reduction=self.reduction) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(new_args, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))
█#015█#015
2020-03-28 01:41:45 Uploading - Uploading generated training model 2020-03-28 01:41:45 Failed - Training job failed
UnexpectedStatusException Traceback (most recent call last)