Open zdk258 opened 1 month ago
Same Question.
After checking source code of mmengine, I found that they just called next
to skip training data:
in mmenging\runner\loops.py IterBasedTrainLoop
if self._iter > 0:
print_log(
f'Advance dataloader {self._iter} steps to skip data '
'that has already been trained',
logger='current',
level=logging.WARNING)
for _ in range(self._iter):
next(self.dataloader_iterator)
In other words, "-- resume" will load data like regular training, but discard all of this data before reaching the specified iteration. Therefore, the time required for resuming will not be much faster than starting a new training session.
I discovered that using a lower version of mmengine
helps resolve the issue. For example:
mim install mmengine==0.10.2
I think this is the cause of the problem. Here's the PR. https://github.com/open-mmlab/mmengine/pull/1471
@chtzs Thanks!
@chtzs Thanks!
I don't understand how to solve it,can you tell me?Extremely thanks!
@chtzs Thanks!
I don't understand how to solve it,can you tell me?Extremely thanks!
Just comment out these lines.
@chtzs Thanks!
I don't understand how to solve it,can you tell me?Extremely thanks!
@Saillxl Solution could be found in this issue:https://github.com/open-mmlab/mmengine/issues/1520
resume模型时卡住也不报错,重新开始训练是可以的。将num_workers设置为1也没用