Closed OswaldoBornemann closed 9 months ago
I found the reason, the reason lies in the load_checkpoint_if_available
.
In this function, it defines something below:
if params.start_batch > 0:
if "cur_epoch" in saved_params:
params["start_epoch"] = saved_params["cur_epoch"]
if "cur_batch_idx" in saved_params:
params["cur_batch_idx"] = saved_params["cur_batch_idx"]
But this would incompatible with the new dataset. Because the batch idx is not consistent. So when i comment them out, the model training is normal.
I think the issue in the first message and the solution in the second are unrelated (although it's good that you found it). I don't see IterableDatasetWrapper
generally used for training in Icefall so it looks like you customized the code. 99% of the time you'd want to follow the message and set rank and world size to (0, 1), otherwise you will be omitting ((world_size-1)/world_size * 100%) portion of training data.
Yeah, i got your mean. However, i have tried to upgrade to the lastest lhoste but not the latest version of k2 or icefall, this warning still happend.
I suggest that you search the code (using grep
/rg
/IDE) for IterableDatasetWrapper
usage and adjust the arguments in the sampler manually.
I am going to train a zipformer based on the previous trained checkpoint model, but on the different dataset. I found the following warning, and it seems that the model is stuck in the process of
About to create train dataloader
.We detected you're trying to use a CutSampler with rank 2 and world_size 3 inside an IterableDatasetWrapper. Setting rank != 0 and world_size != 1 in Lhotse's CutSampler is inteded for map-style datasets, when the sampler exists in the main training loop. Make sure these settings are intentional or pass rank=0 and world_size=1 to the sampler's constructor.