Sagemaker training fails with error : UnexpectedStatusException: Error for Training job : Failed. Reason: AlgorithmError: Exception during training: 'PosixPath' object has no attribute 'decode'

pssnew2pro commented 4 years ago

I tried following this tutorial. https://medium.com/@kaushaltrivedi/train-and-deploy-mighty-transformer-nlp-models-using-fastbert-and-aws-sagemaker-cc4303c51cf3

The training fails, please see below for the entire stack trace:

2020-02-11 01:09:23 Starting - Starting the training job... 2020-02-11 01:09:25 Starting - Launching requested ML instances...... 2020-02-11 01:10:27 Starting - Preparing the instances for training...... 2020-02-11 01:11:41 Downloading - Downloading input data... 2020-02-11 01:12:12 Training - Downloading the training image............... 2020-02-11 01:14:53 Training - Training image download completed. Training in progress...Starting the training. /opt/ml/input/data/training/config/training_config.json {'run_text': 'ALL_CMNT_CLEAN', 'finetuned_model': None, 'do_lower_case': 'True', 'train_file': 'train.csv', 'val_file': 'test.csv', 'label_file': 'labels.csv', 'text_col': 'ALL_CMNT_CLEAN', 'label_col': 'RPR_ACTN_DS', 'multi_label': 'True', 'grad_accumulation_steps': '1', 'fp16_opt_level': 'O1', 'fp16': 'True', 'model_type': 'roberta', 'model_name': 'roberta-base', 'logging_steps': '300'} {'train_batch_size': '16', 'warmup_steps': '1000', 'lr': '8e-05', 'max_seq_length': '512', 'optimizer_type': 'adamw', 'lr_schedule': 'warmup_cosine', 'epochs': '10'} 02/11/2020 01:14:56 - INFO - root - model path used /opt/ml/code/pretrained_models/roberta-base 02/11/2020 01:14:56 - INFO - root - finetuned model not available - loading standard pretrained model 02/11/2020 01:14:56 - INFO - transformers.tokenization_utils - Model name '/opt/ml/code/pretrained_models/roberta-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming '/opt/ml/code/pretrained_models/roberta-base' is a path, a model identifier, or url to a directory containing tokenizer files. Exception during training: 'PosixPath' object has no attribute 'decode' Traceback (most recent call last): File "/opt/ml/code/train", line 138, in train PRETRAINED_PATH, do_lower_case=bool(training_config["do_lower_case"]) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 309, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 339, in _from_pretrained if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 143, in is_remote_url parsed = urlparse(url_or_filename) File "/opt/conda/lib/python3.7/urllib/parse.py", line 367, in urlparse url, scheme, _coerce_result = _coerce_args(url, scheme) File "/opt/conda/lib/python3.7/urllib/parse.py", line 123, in _coerce_args return _decode_args(args) + (_encode_result,) File "/opt/conda/lib/python3.7/urllib/parse.py", line 107, in _decode_args return tuple(x.decode(encoding, errors) if x else '' for x in args) File "/opt/conda/lib/python3.7/urllib/parse.py", line 107, in return tuple(x.decode(encoding, errors) if x else '' for x in args) AttributeError: 'PosixPath' object has no attribute 'decode'

2020-02-11 01:15:05 Uploading - Uploading generated training model 2020-02-11 01:15:05 Failed - Training job failed

UnexpectedStatusException Traceback (most recent call last)

in () ----> 1 estimator.fit(s3_input) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 462 self.jobs.append(self.latest_training_job) 463 if wait: --> 464 self.latest_training_job.wait(logs=logs) 465 466 def _compilation_job_name(self): ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1060 # If logs are requested, call logs_for_jobs. 1061 if logs != "None": -> 1062 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1063 else: 1064 self.sagemaker_session.wait_for_job(self.job_name) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3001 3002 if wait: -> 3003 self._check_job_status(job_name, description, "TrainingJobStatus") 3004 if dot: 3005 print() ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 2595 ), 2596 allowed_statuses=["Completed", "Stopped"], -> 2597 actual_status=status, 2598 ) 2599 UnexpectedStatusException: Error for Training job nls-ntf-2020-02-11-01-09-23-620: Failed. Reason: AlgorithmError: Exception during training: 'PosixPath' object has no attribute 'decode' Traceback (most recent call last): File "/opt/ml/code/train", line 138, in train PRETRAINED_PATH, do_lower_case=bool(training_config["do_lower_case"]) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 309, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 339, in _from_pretrained if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 143, in is_remote_url parsed = urlparse(url_or_filename) File "/opt/conda/lib/python3.7/urllib/parse.py", line 367, in urlparse url, scheme, _coerce_result = _coerce_args(url, scheme) File "/opt/conda/lib/python3.7/urllib/parse.py", line 123, in _coerce_args return _decode_args(args) + (_encode_result

splevine commented 4 years ago

I am getting a similar error message:

02/13/2020 16:16:53 - INFO - root - model path used /opt/ml/code/pretrained_models/roberta-base 02/13/2020 16:16:53 - INFO - root - finetuned model not available - loading standard pretrained model 02/13/2020 16:16:53 - INFO - transformers.tokenization_utils - Model name '/opt/ml/code/pretrained_models/roberta-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming '/opt/ml/code/pretrained_models/roberta-base' is a path, a model identifier, or url to a directory containing tokenizer files. Exception during training: 'PosixPath' object has no attribute 'decode' Traceback (most recent call last): File "./train", line 138, in train PRETRAINED_PATH, do_lower_case=bool(training_config["do_lower_case"]) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 309, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 339, in _from_pretrained if os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): File "/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py", line 143, in is_remote_url parsed = urlparse(url_or_filename) File "/opt/conda/lib/python3.7/urllib/parse.py", line 367, in urlparse url, scheme, _coerce_result = _coerce_args(url, scheme) File "/opt/conda/lib/python3.7/urllib/parse.py", line 123, in _coerce_args return _decode_args(args) + (_encode_result,) File "/opt/conda/lib/python3.7/urllib/parse.py", line 107, in _decode_args return tuple(x.decode(encoding, errors) if x else '' for x in args) File "/opt/conda/lib/python3.7/urllib/parse.py", line 107, in <genexpr> return tuple(x.decode(encoding, errors) if x else '' for x in args) AttributeError: 'PosixPath' object has no attribute 'decode'

Any help would be appreciated

Jarryd-rk commented 4 years ago

I managed to run the example by changing line 114 of the train file in container/bert from:

PRETRAINED_PATH = Path(pretrained_model_path) / training_config["model_name"]

to:

PRETRAINED_PATH = pretrained_model_path + '/' + str(training_config["model_name"])

seems like urlparse doesn't like being sent a PosixPath but can deal with a String.

utterworks / fast-bert

Sagemaker training fails with error : UnexpectedStatusException: Error for Training job : Failed. Reason: AlgorithmError: Exception during training: 'PosixPath' object has no attribute 'decode' #175