utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.86k stars 341 forks source link

ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]) #189

Open katyov opened 4 years ago

katyov commented 4 years ago

SageMaker training with imdb sample data throws the error: ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2])

hyperparameters = { "epochs": 6, "lr": 5e-3, "max_seq_length": 512, "train_batch_size": 8, "lr_schedule": "warmup_cosine", "warmup_steps": 500, "optimizer_type": "lamb" }

training_config = { "run_text": "imdb_reviews", "finetuned_model": None, "do_lower_case": "True", "train_file": "train_sample.csv", "val_file": "val_sample.csv", "label_file": "labels.csv", "text_col": "text", "label_col": '["label"]', "multi_label": "False", "grad_accumulation_steps": "1", "fp16_opt_level": "O1", "fp16": "False", "model_type": "roberta", "model_name": "roberta-base", "logging_steps": "300" }

estimator = sagemaker.estimator.Estimator(image, role, train_instance_count=1, train_instance_type='ml.p3.8xlarge', output_path=bucket_path_output, base_job_name='imdb-reviews', hyperparameters=hyperparameters, sagemaker_session=session )

Run log: 2020-03-28 01:35:52 Starting - Starting the training job... 2020-03-28 01:35:54 Starting - Launching requested ML instances...... 2020-03-28 01:36:55 Starting - Preparing the instances for training...... 2020-03-28 01:38:17 Downloading - Downloading input data... 2020-03-28 01:38:29 Training - Downloading the training image............... 2020-03-28 01:41:09 Training - Training image download completed. Training in progress.Starting the training. /opt/ml/input/data/training/config/training_config.json {'run_text': 'imdb_reviews', 'finetuned_model': None, 'do_lower_case': 'True', 'train_file': 'train_sample.csv', 'val_file': 'val_sample.csv', 'label_file': 'labels.csv', 'text_col': 'text', 'label_col': '["label"]', 'multi_label': 'False', 'grad_accumulation_steps': '1', 'fp16_opt_level': 'O1', 'fp16': 'False', 'model_type': 'roberta', 'model_name': 'roberta-base', 'logging_steps': '300'} {'train_batch_size': '8', 'warmup_steps': '500', 'lr': '0.005', 'max_seq_length': '512', 'optimizer_type': 'lamb', 'lr_schedule': 'warmup_cosine', 'epochs': '6'} 03/28/2020 01:41:12 - INFO - root - model path used /opt/ml/code/pretrained_models/roberta-base 03/28/2020 01:41:12 - INFO - root - finetuned model not available - loading standard pretrained model 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Model name '/opt/ml/code/pretrained_models/roberta-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming '/opt/ml/code/pretrained_models/roberta-base' is a path, a model identifier, or url to a directory containing tokenizer files. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/added_tokens.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/special_tokens_map.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - Didn't find file /opt/ml/code/pretrained_models/roberta-base/tokenizer_config.json. We won't load it. 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/vocab.json 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file /opt/ml/code/pretrained_models/roberta-base/merges.txt 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - transformers.tokenization_utils - loading file None 03/28/2020 01:41:12 - INFO - root - Number of GPUs: 4 03/28/2020 01:41:12 - INFO - root - label columns: ['label'] 03/28/2020 01:41:12 - INFO - root - Writing example 0 of 100 03/28/2020 01:41:12 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_train_multi_label_512_train_sample.csv 03/28/2020 01:41:13 - INFO - root - Writing example 0 of 50 03/28/2020 01:41:13 - INFO - root - Saving features into cached file /opt/ml/input/data/training/cache/cached_roberta_dev_multi_label_512_val_sample.csv 03/28/2020 01:41:13 - INFO - root - databunch labels: 2 03/28/2020 01:41:13 - INFO - root - multilabel: True, multilabel type: <class 'bool'> 03/28/2020 01:41:13 - INFO - transformers.configuration_utils - loading configuration file /opt/ml/code/pretrained_models/roberta-base/config.json 03/28/2020 01:41:13 - INFO - transformers.configuration_utils - Model config RobertaConfig { "_num_labels": 2, "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "do_sample": false, "early_stopping": false, "eos_token_id": 2, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-05, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 514, "min_length": 0, "model_type": "roberta", "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 1, "pruned_heads": {}, "repetition_penalty": 1.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 1, "use_bfloat16": false, "vocab_size": 50265 }

/opt/ml/code/pretrained_models/roberta-base <class 'str'> 03/28/2020 01:41:13 - INFO - transformers.modeling_utils - loading weights file /opt/ml/code/pretrained_models/roberta-base/pytorch_model.bin 03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights of RobertaForMultiLabelSequenceClassification not initialized from pretrained model: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias'] 03/28/2020 01:41:19 - INFO - transformers.modeling_utils - Weights from pretrained model not used in RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight'] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic 03/28/2020 01:41:24 - INFO - root - Running training 03/28/2020 01:41:24 - INFO - root - Num examples = 100 03/28/2020 01:41:24 - INFO - root - Num Epochs = 6 03/28/2020 01:41:24 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 32 03/28/2020 01:41:24 - INFO - root - Gradient Accumulation steps = 1 03/28/2020 01:41:24 - INFO - root - Total optimization steps = 24 Exception during training: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward logits.view(-1, self.num_labels), labels.view(-1, self.num_labels) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward reduction=self.reduction) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))

Traceback (most recent call last): File "/opt/ml/code/train", line 220, in train optimizer_type=hyperparameters["optimizer_type"], File "/opt/conda/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 368, in fit outputs = self.model(inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward logits.view(-1, self.num_labels), labels.view(-1, self.num_labels) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward reduction=self.reduction) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(new_args, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueError: Target size (torch.Size([4, 2])) must be the same as input size (torch.Size([8, 2]))

█#015█#015

2020-03-28 01:41:45 Uploading - Uploading generated training model 2020-03-28 01:41:45 Failed - Training job failed

UnexpectedStatusException Traceback (most recent call last)

in () ----> 1 estimator.fit(s3_input) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 468 self.jobs.append(self.latest_training_job) 469 if wait: --> 470 self.latest_training_job.wait(logs=logs) 471 472 def _compilation_job_name(self): ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1061 # If logs are requested, call logs_for_jobs. 1062 if logs != "None": -> 1063 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1064 else: 1065 self.sagemaker_session.wait_for_job(self.job_name) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3021 3022 if wait: -> 3023 self._check_job_status(job_name, description, "TrainingJobStatus") 3024 if dot: 3025 print() ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 2615 ), 2616 allowed_statuses=["Completed", "Stopped"], -> 2617 actual_status=status, 2618 ) 2619 UnexpectedStatusException: Error for Training job imdb-reviews-2020-03-28-01-35-52-697: Failed. Reason: AlgorithmError: Exception during training: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/fast_bert/modeling.py", line 127, in forward logits.view(-1, self.num_labels), labels.view(-1, self.num_labels) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward reduction=self.reduction) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(*new_args, **kwargs) File "/opt/conda/l
ssalawu commented 4 years ago

Did you manage to resolve this issue? I have a similar problem.

veilupt commented 4 years ago

@kaushaltrivedi i have also same problem,

RaedShabbir commented 3 years ago

Same problem, what was the issue?