intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

client.connect(path) error when saving checkpoint #1337

Open atomrun39 opened 4 hours ago

atomrun39 commented 4 hours ago

When using dlrover to save checkpoints, the following error will always occur:

[2024-11-15 12:30:37,876] [INFO] [engine.py:131:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-11-15 12:30:37,879] [INFO] [engine.py:299:_notify_agent_to_create_saver] Notify agent to create a checkpoint saver using: {'module_path': 'dlrover.python.elastic_agent.torch.ckpt_saver', 'class_name': 'DeepSpeedCheckpointSaver', 'kwargs': {'checkpoint_dir': '/work/share/chenyd/finetune/ChatGLM2-6B/model/checkpoints_out/ALL/original/chatglm2-6b/checkpoint-15', 'storage_meta': ClassMeta(module_path='dlrover.python.common.storage', class_name='PosixDiskStorage', kwargs={}), 'local_shard_num': 8, 'global_shard_num': 16, 'save_timeout': 600}}.
[2024-11-15 12:30:37,879] [WARNING] [multi_process.py:91:_create_socket_client] Unexpected error when creating socket client by path: /tmp/ckpt_sock/1857279191730585602/sharedqueue_factory.sock, error: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py", line 89, in _create_socket_client
    client.connect(path)
FileNotFoundError: [Errno 2] No such file or directory
[2024-11-15 12:30:37,895] [INFO] [ckpt_saver.py:451:_factory] Start the checkpoint saver factory.
/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:48: ResourceWarning: unclosed <socket.socket fd=116, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
  time.sleep(1)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

The code used is as follows:

           checkpointer = DeepSpeedCheckpointer(model, output_dir)
            result = checkpointer.save_checkpoint(
            output_dir,
            tag=self.state.global_step,
            storage_type=StorageType.DISK
            )

How to solve this problem? I really hope to receive a reply.

atomrun39 commented 4 hours ago

In addition, there are always warnings like this during the saving process. How can I eliminate them?

/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:271: ResourceWarning: unclosed <socket.socket fd=127, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/tmp/ckpt_sock/1857345181448912897/sharedlock_shm_lock_1.sock>
  connection, _ = self._server.accept()
ResourceWarning: Enable tracemalloc to get the object allocation traceback