TDIT-haha commented 1 year ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmpose).

Environment

mmcv==2.0.0 mmpose==1.0.0 mmengine==0.7.3

Reproduces the problem - code sample

修改./tools/dist_train.sh config/body_2d_keypoint/rtmpose/coco/rtmpose-t_8xb256-420e_coco-256x192.py内的batch_size参数

data loaders

train_dataloader = dict( batch_size=512, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type=dataset_type, data_root=data_root, data_mode=data_mode, ann_file='annotations/person_keypoints_train2017.json', data_prefix=dict(img='train2017/'), pipeline=train_pipeline, ))

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=22501 bash ./tools/dist_train.sh config/body_2d_keypoint/rtmpose/coco/rtmpose-t_8xb256-420e_coco-256x192.py 4

Reproduces the problem - error message

RuntimeError:unable to open share memory object in read-write mode:Too many open files(24)。我把batch_size设置512或更大后，运行mmpose后出现的问题。但是我在其他框架训练其他任务的时候，也有把batch_size设置到512的情况是能正常运行的。唯独运行mmpose时出现这个问题，我将batch_size调小后是可以正常运行的，但是这导致无法充分利用GPU的资源，由于在issue中我没有找到有效解决方案，所以这是什么问题？什么导致的？该怎么解决这个问题呢？

Additional information

预期是在batch_size=512下能正常运行
对于这个问题，暂时没有什么思路。

mm-assistant[bot] commented 1 year ago

We recommend using English or English & Chinese for issues so that we could have broader discussion.

Ben-Louis commented 1 year ago

您好，您可以尝试参照 https://github.com/pytorch/pytorch/issues/11201 ，在 tools/train.py 开头加上

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

再进行训练

TDIT-haha commented 10 months ago

但是这样我发现容易出现内存泄露的问题，那如果我不使用file_system，还有其他方法么？

YiweiWang2000 commented 2 months ago

我也遇到了这个问题请问有解决方法了吗

open-mmlab / mmpose

[Bug] batch_size设置大，而出现too many open files #2595