open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.55k stars 1.21k forks source link

[Bug] batch_size设置大,而出现too many open files #2595

Closed TDIT-haha closed 1 year ago

TDIT-haha commented 1 year ago

Prerequisite

Environment

mmcv==2.0.0 mmpose==1.0.0 mmengine==0.7.3

Reproduces the problem - code sample

修改./tools/dist_train.sh config/body_2d_keypoint/rtmpose/coco/rtmpose-t_8xb256-420e_coco-256x192.py内的batch_size参数

data loaders

train_dataloader = dict( batch_size=512, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type=dataset_type, data_root=data_root, data_mode=data_mode, ann_file='annotations/person_keypoints_train2017.json', data_prefix=dict(img='train2017/'), pipeline=train_pipeline, ))

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=22501 bash ./tools/dist_train.sh config/body_2d_keypoint/rtmpose/coco/rtmpose-t_8xb256-420e_coco-256x192.py 4

Reproduces the problem - error message

RuntimeError:unable to open share memory object in read-write mode:Too many open files(24)。 我把batch_size设置512或更大后,运行mmpose后出现的问题。但是我在其他框架训练其他任务的时候,也有把batch_size设置到512的情况是能正常运行的。唯独运行mmpose时出现这个问题,我将batch_size调小后是可以正常运行的,但是这导致无法充分利用GPU的资源,由于在issue中我没有找到有效解决方案,所以这是什么问题?什么导致的?该怎么解决这个问题呢?

Additional information

  1. 预期是在batch_size=512下能正常运行
  2. 对于这个问题,暂时没有什么思路。
mm-assistant[bot] commented 1 year ago

We recommend using English or English & Chinese for issues so that we could have broader discussion.

Ben-Louis commented 1 year ago

您好,您可以尝试参照 https://github.com/pytorch/pytorch/issues/11201 ,在 tools/train.py 开头加上

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

再进行训练

TDIT-haha commented 10 months ago

但是这样我发现容易出现内存泄露的问题,那如果我不使用file_system,还有其他方法么?

YiweiWang2000 commented 2 months ago

我也遇到了这个问题 请问有解决方法了吗