Runtime error during training

abedini-arteriaai commented 10 months ago

Hi Team,

I've been using SetFit with a variety of different datasets and base models and this has been a persistent issue. It happens during trainer.train().

The error is RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/158: file write failed, I previously thought it could be a memory issue and have done the following to mitigate: 1) trying only 1 epoch, batch size 1, max_steps 1, eval_max_steps 1 2) reducing the dataset to only 100 samples 3) checked !df -Th and there is plenty of memory

Is this a familiar error?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /databricks/python/lib/python3.10/site-packages/torch/serialization.py:441, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    440 with _open_zipfile_writer(f) as opened_zipfile:
**--> 441     _save(obj, opened_zipfile, pickle_module, pickle_protocol)**
    442     return

File /databricks/python/lib/python3.10/site-packages/torch/serialization.py:668, in _save(obj, zip_file, pickle_module, pickle_protocol)
    667 num_bytes = storage.nbytes()
**--> 668 zip_file.write_record(name, storage.data_ptr(), num_bytes)**

RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/158: file write failed

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
File <command-3835036500502814>, line 1
**----> 1 trainer.train()** 
      3 metrics = trainer.evaluate()
      5 # Change

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/trainer.py:410, in Trainer.train(self, args, trial, **kwargs)
    405 train_parameters = self.dataset_to_parameters(self.train_dataset)
    406 full_parameters = (
    407     train_parameters + self.dataset_to_parameters(self.eval_dataset) if self.eval_dataset else train_parameters
    408 )
**--> 410 self.train_embeddings(*full_parameters, args=args)**
    411 self.train_classifier(*train_parameters, args=args)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/trainer.py:462, in Trainer.train_embeddings(self, x_train, y_train, x_eval, y_eval, args)
    459 logger.info(f"  Total optimization steps = {total_train_steps}")
    461 warmup_steps = math.ceil(total_train_steps * args.warmup_proportion)
**--> 462 self._train_sentence_transformer(
    463     self.model.model_body,
    464     train_dataloader=train_dataloader,
    465     eval_dataloader=eval_dataloader,
    466     args=args,
    467     loss_func=loss_func,
    468     warmup_steps=warmup_steps,
    469 )**

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/trainer.py:655, in Trainer._train_sentence_transformer(self, model_body, train_dataloader, eval_dataloader, args, loss_func, warmup_steps)
    652 self.state.epoch = epoch + (step + 1) / steps_per_epoch
    653 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
**--> 655 self.maybe_log_eval_save(model_body, eval_dataloader, args, scheduler_obj, loss_func, loss_value)**
    657 if self.control.should_epoch_stop or self.control.should_training_stop:
    658     break

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/trainer.py:715, in Trainer.maybe_log_eval_save(self, model_body, eval_dataloader, args, scheduler_obj, loss_func, loss_value)
    712     loss_func.train()
    714 if self.control.should_save:
**--> 715     checkpoint_dir = self._checkpoint(self.args.output_dir, args.save_total_limit, self.state.global_step)**
    716     self.control = self.callback_handler.on_save(self.args, self.state, self.control)
    718     if eval_loss is not None and (self.state.best_metric is None or eval_loss < self.state.best_metric):

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/trainer.py:770, in Trainer._checkpoint(self, checkpoint_path, checkpoint_save_total_limit, step)
    767         shutil.rmtree(old_checkpoints[0]["path"])
    769 checkpoint_file_path = str(Path(checkpoint_path) / f"step_{step}")
**--> 770 self.model.save_pretrained(checkpoint_file_path)**
    771 return checkpoint_file_path

File /databricks/python/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:31, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     29 extra_args = len(args) - len(all_args)
     30 if extra_args <= 0:
**---> 31     return f(*args, **kwargs)**
     32 # extra_args > 0
     33 args_msg = [
     34     f"{name}='{arg}'" if isinstance(arg, str) else f"{name}={arg}"
     35     for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:])
     36 ]

File /databricks/python/lib/python3.10/site-packages/huggingface_hub/hub_mixin.py:64, in ModelHubMixin.save_pretrained(self, save_directory, config, repo_id, push_to_hub, **kwargs)
     61 save_directory.mkdir(parents=True, exist_ok=True)
     63 # saving model weights/files
**---> 64 self._save_pretrained(save_directory)**
     66 # saving config
     67 if isinstance(config, dict):

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4c4a8968-e6fe-4ab2-9fac-b5ec2d26e6b7/lib/python3.10/site-packages/setfit/modeling.py:695, in SetFitModel._save_pretrained(self, save_directory)
    685     json.dump(
    686         {
    687             attr_name: getattr(self, attr_name)
   (...)
    692         indent=2,
    693     )
    694 # Save the body
****--> 695 self.model_body.save(path=save_directory, create_model_card=False)****
    696 # Save the README
    697 self.create_model_card(path=save_directory, model_name=save_directory)

File /databricks/python/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py:375, in SentenceTransformer.save(self, path, model_name, create_model_card, train_datasets)
    372         model_path = os.path.join(path, str(idx)+"_"+type(module).__name__)
    374     os.makedirs(model_path, exist_ok=True)
**--> 375     module.save(model_path)**
    376     modules_config.append({'idx': idx, 'name': name, 'path': os.path.basename(model_path), 'type': type(module).__module__})
    378 with open(os.path.join(path, 'modules.json'), 'w') as fOut:

File /databricks/python/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py:121, in Transformer.save(self, output_path)
    120 def save(self, output_path: str):
**--> 121     self.auto_model.save_pretrained(output_path)**
    122     self.tokenizer.save_pretrained(output_path)
    124     with open(os.path.join(output_path, 'sentence_bert_config.json'), 'w') as fOut:

File /databricks/python/lib/python3.10/site-packages/transformers/modeling_utils.py:1847, in PreTrainedModel.save_pretrained(self, save_directory, is_main_process, state_dict, save_function, push_to_hub, max_shard_size, safe_serialization, variant, **kwargs)
   1845         safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
   1846     else:
**-> 1847         save_function(shard, os.path.join(save_directory, shard_file))**
   1849 if index is None:
   1850     path_to_weights = os.path.join(save_directory, _add_variant(WEIGHTS_NAME, variant))

File /databricks/python/lib/python3.10/site-packages/torch/serialization.py:440, in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    437 _check_save_filelike(f)
    439 if _use_new_zipfile_serialization:
**--> 440     with _open_zipfile_writer(f) as opened_zipfile:**
    441         _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    442         return

File /databricks/python/lib/python3.10/site-packages/torch/serialization.py:291, in _open_zipfile_writer_file.__exit__(self, *args)
    290 def __exit__(self, *args) -> None:
**--> 291     self.file_like.write_end_of_file()**

RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 201721536 vs 201721424

tomaarsen commented 10 months ago

Hello!

Hmm, I haven't seen this error before. It seems to occur when checkpointing during training. By default, this happens every 500 steps (see docs here).

Other users have experienced similar issues: https://stackoverflow.com/questions/64206070/pytorch-runtimeerror-enforce-fail-at-inline-container-cc209-file-not-fou and they attributed it to either:

A corrupted torch file - perhaps clearing the cache ~/.cache/torch/sentence_transformers/... helps?
Not enough space.

Alternatively, you can set save_strategy="no" to prevent the model from doing any checkpointing during training. That should help, although you might want to save the final model & doing so might still crash if something is corrupted.

Tom Aarsen

abedini-arteriaai commented 10 months ago

Thank you for the tip, changing save_strategy solved the issue - seems to be space related.

huggingface / setfit

Runtime error during training #477