open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.68k stars 2.53k forks source link

resume卡住 #3671

Open zdk258 opened 1 month ago

zdk258 commented 1 month ago

resume模型时卡住也不报错,重新开始训练是可以的。将num_workers设置为1也没用

05/17 11:07:53 - mmengine - INFO - resumed epoch: 0, iter: 32500 05/17 11:07:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 05/17 11:07:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 05/17 11:07:53 - mmengine - WARNING - Advance dataloader 32500 steps to skip data that has already been trained

ShenZheng2000 commented 1 month ago

Same Question.

chtzs commented 1 month ago

After checking source code of mmengine, I found that they just called next to skip training data: in mmenging\runner\loops.py IterBasedTrainLoop

        if self._iter > 0:
            print_log(
                f'Advance dataloader {self._iter} steps to skip data '
                'that has already been trained',
                logger='current',
                level=logging.WARNING)
            for _ in range(self._iter):
                next(self.dataloader_iterator)

In other words, "-- resume" will load data like regular training, but discard all of this data before reaching the specified iteration. Therefore, the time required for resuming will not be much faster than starting a new training session.

ShenZheng2000 commented 1 month ago

I discovered that using a lower version of mmengine helps resolve the issue. For example:

mim install mmengine==0.10.2
chtzs commented 1 month ago

I think this is the cause of the problem. Here's the PR. https://github.com/open-mmlab/mmengine/pull/1471

ShenZheng2000 commented 1 month ago

@chtzs Thanks!

Saillxl commented 1 month ago

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

Outlying3720 commented 1 month ago

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

Just comment out these lines.

chtzs commented 3 weeks ago

@chtzs Thanks!

I don't understand how to solve it,can you tell me?Extremely thanks!

@Saillxl Solution could be found in this issue:https://github.com/open-mmlab/mmengine/issues/1520