it takes too long for DynamicBucketingSampler to load state dict

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

https://lhotse.readthedocs.io/en/latest/

Apache License 2.0

902 stars 204 forks source link

it takes too long for DynamicBucketingSampler to load state dict #1327

Open Mahaotian1 opened 2 months ago

Mahaotian1 commented 2 months ago

When I retrained 30,000 hours of data from checkpoint, it took a long time to load state dict for DynamicBucketingSampler(more than 2 hours).It's it normal ?

here is my code:

train_sampler = DynamicBucketingSampler(
         cuts_train,
         max_duration=self.args.max_duration,
         shuffle=self.args.shuffle,
         buffer_size=self.args.buffer_size,                 # 40000
         shuffle_buffer_size=self.args.shuffle_buffer_size, # 100000
         quadratic_duration=10,
         num_cuts_for_bins_estimate=10000,
         drop_last=True,)
logging.info("Loading sampler state dict")
train_sampler.load_state_dict(sampler_state_dict)

pzelasko commented 2 months ago

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

Mahaotian1 commented 2 months ago

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

Thank you for your reply. I have another question I would like to ask, the question is that during the training of large scale data, I use load_manifest_lazy to read the data and take every batch on it, will it cause the cpu memory to be full?

pzelasko commented 2 months ago

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

Mahaotian1 commented 2 months ago

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

Why does the cpu memory continue to increase during training until it is full？ Is it the problem of h5file？ How can I free up memory？

pzelasko commented 2 months ago

Are you using HDF5 files? We have a workaround fix in ASR dataset class but IIRC it only slows down the memory leak. You can try to use Lhotse Shar format instead, or LilcomChunkyWriter which are free from these issues. For large data, Lhotse Shar is recommended as it is much more io efficient.