Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close()

batch=336/5000]: Train time/batch: 335 Train time/sample: 85760 Train time/batch_in_epoch: 335 Train time/sample_in_epoch: 85760 Train time/token: 351272960 Train time/token_in_epoch: 351272960 Train metrics/train/academic_en_weight: 0.0733 Train metrics/train/book_en_weight: 0.2521 Train metrics/train/code_weight: 0.0122 Train metrics/train/qa_en_weight: 0.0900 Train metrics/train/webtext_en_weight: 0.4453 Train metrics/train/wiki_en_weight: 0.1271 Train memory/current_allocated_mem: 17.7600 Train memory/current_active_mem: 17.7600 Train memory/current_inactive_mem: 3.8283 Train memory/current_reserved_mem: 74.4430 Train memory/peak_allocated_mem: 60.7340 Train memory/peak_active_mem: 61.1380 Train memory/peak_inactive_mem: 27.1850 Train memory/peak_reserved_mem: 74.4430 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 16 Train loss/train/total: 1.8199 Train loss/train/ce_loss: 1.8199 Train metrics/train/LanguageCrossEntropy: 1.8199 Train metrics/train/Perplexity: 6.1711 Train metrics/train/academic_en_LanguageCrossEntropy: 1.3244 Train metrics/train/academic_en_count: 9893 Train metrics/train/book_en_LanguageCrossEntropy: 1.8898 Train metrics/train/book_en_count: 20179 Train metrics/train/code_LanguageCrossEntropy: 0.8853 Train metrics/train/code_count: 4807 Train metrics/train/qa_en_LanguageCrossEntropy: 1.4436 Train metrics/train/qa_en_count: 10816 Train metrics/train/webtext_en_LanguageCrossEntropy: 2.0363 Train metrics/train/webtext_en_count: 27327 Train metrics/train/wiki_en_LanguageCrossEntropy: 1.5175 Train metrics/train/wiki_en_count: 12994 Train throughput/batches_per_sec: 0.0204 Train throughput/samples_per_sec: 5.2330 Train throughput/device/batches_per_sec: 0.0026 Train throughput/device/samples_per_sec: 0.6541 Train throughput/tokens_per_sec: 21434.4781 Train throughput/device/tokens_per_sec: 2679.3098 Train throughput/flops_per_sec: 1004697191366202.0000 Train throughput/device/flops_per_sec: 125587148920775.2500 Train time/train: 4.5453 Train time/val: 0.1311 Train time/total: 4.6764 Train lr-DecoupledAdamW/group0: 0.0001 Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close(). Traceback (most recent call last): File "/home/pai/lib/python3.9/site-packages/composer/core/engine.py", line 527, in _close callback.close(state, logger) File "/home/pai/lib/python3.9/site-packages/composer/callbacks/checkpoint_saver.py", line 310, in close self._save_checkpoint( File "/home/pai/lib/python3.9/site-packages/composer/callbacks/checkpoint_saver.py", line 332, in _save_checkpoint saved_path = checkpoint.save_checkpoint( File "/home/pai/lib/python3.9/site-packages/composer/utils/checkpoint.py", line 761, in save_checkpoint 'state': state.state_dict(), File "/home/pai/lib/python3.9/site-packages/composer/core/state.py", line 891, in state_dict serialized_value = self.get_model_state_dict() File "/home/pai/lib/python3.9/site-packages/composer/core/state.py", line 868, in get_model_state_dict model_state_dict = self.model.state_dict() File "/home/pai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/pai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/pai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1722, in _save_to_state_dict hook(self, prefix, keep_vars) File "/home/pai/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/pai/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 669, in _pre_state_dict_hook _pre_state_dict_hook_fn[fsdp_state._state_dict_type]( File "/home/pai/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 271, in _full_pre_state_dict_hook _common_unshard_pre_state_dict_hook( File "/home/pai/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 143, in _common_unshard_pre_state_dict_hook _enter_unshard_params_ctx( File "/home/pai/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py", line 109, in _enter_unshard_params_ctx fsdp_state._unshard_params_ctx[module].enter() File "/home/pai/lib/python3.9/contextlib.py", line 119, in enter return next(self.gen) File "/home/pai/lib/python3.9/site-packages/torch/distributed/fsdp/_unshard_param_utils.py", line 186, in _unshard_fsdp_state_params assert ( AssertionError: Expects the handle training to be IDLE but got HandleTrainingState.BACKWARD_PRE Stack (most recent call last): File "/home/pai/lib/python3.9/site-packages/composer/core/engine.py", line 483, in del self.close() File "/home/pai/lib/python3.9/site-packages/composer/core/engine.py", line 512, in close self._close(self.state, self.logger) File "/home/pai/lib/python3.9/site-packages/composer/core/engine.py", line 529, in _close

Has anyone encountered this issue before? Why do I always get an error when I train with eight cards on a single machine for over 300 batchs, even though I definitely have enough training data?

princeton-nlp / LLM-Shearing

Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close() #48