lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
932 stars 213 forks source link

Take much time for train_dl.sampler to load state dict #750

Open luomingshuang opened 2 years ago

luomingshuang commented 2 years ago

In https://github.com/k2-fsa/icefall/pull/413, when I try to a checkpoint-xxx.pt for continuing training (with the newest lhotse version), I find that the process of load_state_dict for sampler takes much time.
As we can see from the two following pictures, the string ("loading 2...") can't appear quickily. Then I use py-spy to spy the process. It shows the process is making iterations for data. The sampler can't skip the designated batch_idx directly.

aad47fbedcd0bbd55c445a68c7067e8 3f215b090bd3f7238bf72a89aa12ade

https://user-images.githubusercontent.com/37799481/173982106-a573075f-c3d0-4e51-bb4c-88a02f20663f.mp4

luomingshuang commented 2 years ago

I remember that I met this problem before, and we also talk about it https://github.com/k2-fsa/icefall/pull/314#issuecomment-1104645686.

danpovey commented 2 years ago

What is your lhotse version? it's printed out at top of training log.

luomingshuang commented 2 years ago

My lhotse version is /ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.4.0.dev0+git.94e9ed9.clean-py3.8.egg/lhotse/__init__.py.

danpovey commented 2 years ago

To make it easier to debug, can you print out the contents of the sampler state_dict? You can load the .pt file into the python command line with torch.load().

luomingshuang commented 2 years ago

OK.

luomingshuang commented 2 years ago
>>> a= torch.load('checkpoint-88000.pt')
>>> a['sampler']
{'epoch': 0, 'drop_last': True, 'world_size': 8, 'rank': 0, 'seed': 0, 'shuffle': True,
 'diagnostics': {'current_epoch': 0, 'stats_per_epoch': {0: {'epoch': 0, 'kept_cuts': 39638006,
 'discarded_cuts': 0, 'kept_batches': 703984, 'discarded_batches': 0}}}, 'max_duration': 150, 
'max_cuts': None, 'consistent_ids': True, 'buffer_size': 30000, 'num_cuts_for_bins_estimate': 10000, 
'shuffle_buffer_size': 20000, 'strict': True}
>>>
>>> a= torch.load('epoch-0.pt')
>>> a['sampler']
{'epoch': 0, 'drop_last': True, 'world_size': 8, 'rank': 0, 'seed': 0, 'shuffle': True, 
'diagnostics': {'current_epoch': 0, 'stats_per_epoch': {0: {'epoch': 0, 'kept_cuts': 39735792,
 'discarded_cuts': 0, 'kept_batches': 706146, 'discarded_batches': 0}}}, 'max_duration': 150, 
'max_cuts': None, 'consistent_ids': True, 'buffer_size': 30000, 'num_cuts_for_bins_estimate': 10000, 
'shuffle_buffer_size': 20000, 'strict': True}
>>>
pzelasko commented 2 years ago

Hmm it looks like that code has to iterate over 39.6M cuts to reach the point at which your training checkpoint was saved. It makes sense to me that it takes a moment to do that.

To solve it, basically we’d want to somehow store the offset of the manifest file that is being read, but since we’re shuffling dynamically, it’s not viable. Unless we were willing to store shuffle_buffer_size number of cuts in the sampler checkpoint… then maybe it could work.

Any thoughts or suggestions?

danpovey commented 2 years ago

as long as it's only iterating over the metadata I think the current design is OK.

chenguoguo commented 1 year ago

I recently ran into the same issue for larger datasets. I was trying to use --start-batch from icefall to resume the training but it was loading the data for more than 10 hours without kicking off the training. Had to copy the checkpoint to something like epoch-1.pt and start from a new epoch.

I think it makes sense to come up with some fixes for this issue, even if it has to sacrifice the repeatability; for larger datasets this probably won't make much a difference. Thoughts? @pzelasko

danpovey commented 1 year ago

Moving my duplicate issue to here. What I commented is below.

@pzelasko is there a way to efficiently fast-forward a data loader for purposes of resuming runs with --start-batch? Someone was asking about this; it seems that they tried the --start-batch option in one of our setups that uses DynamicBucketingSampler, but it actually loads the data and it takes too long. It seems to me that this is a serious problem if you have a really huge amount of data, as it just sits there and wastes the GPU for a long time. But I'm guessing it might be hard to fix as there is no way to tell the worker process not to really load the data this time.

I'd be more than willing to sacrifice exact repeatability in order to fix this issue, or move to a different type of data loader. BTW I've been using a data loader that is i.i.d. and that makes no attempt to avoid selecting the same type of sample twice, for a language modeling task, and I found it a very nice solution because the code becomes very simple and fast-forwarding the data is also very simple if you aren't too bothered about exactly reproducing what would have happened if you had kept training. (There is really no state to restore and nothing to do other than setting a different random-seed from what you would have been using anyway in order to avoid seeing the exact same data twice).

My concrete suggestion is as follows: how about creating a data loader that cuts the Gordian knot of complexity by not ensuring that the same data is loaded on every epoch or even that there are no duplicated pieces of data within an epoch or even within a batch. Just make everything i.i.d., and the only way epochs are different is that they have a different random seed. That code would be super easy to maintain.

pzelasko commented 1 year ago

I agree with both of you, and I generally found that the simplest way to handle this is to resume the training with a different random seed. I recommend ditching sampler.state_dict/load_state_dict and just changing the seed.

chenguoguo commented 1 year ago

@pzelasko Let me make sure that I understand this correctly. As a quick fix, Piotr you are suggesting that when we resume from a certain checkpoint, we initialize DynamicBucketingSampler with a random seed, instead of the default seed 0, and then we skip loading the sampler state dict and start the training. This way each time we resume the training we resume from a random place from the training data. Is this correct?

pzelasko commented 1 year ago

Yes, but there's a caveat -- since DynamicBucketingSampler reads the manifests sequentially top to bottom, the setting of shuffle_buffer_size is going to matter, and the higher you set it, the more randomness you will observe. You can probably improve the randomness by opening the cut set like: load_manifest_lazy("pipe:gunzip -c cuts.jsonl.gz | shuf") -- I think shuf is hard to beat in randomness quality when doing streaming shuffling. Actually if you open it that way, every time you start the script or start iterating the cutset again, shuf will auto-change the seed, it may be close to the optimal solution.

As a side note this problem is solved elegantly if you're using sharding (i.e. the dataset is partitioned in N chunks of M examples each), because we can also shuffle the list of shards. This is currently only supported for Lhotse Shar and Webdataset formats with sequential I/O, but if there's interest, I think the support can be easily added for regular manifests as well. This is described in the following tutorials:

chenguoguo commented 1 year ago

Thanks @pzelasko ! I have a huge cuts.jsonl.gz file (hundreds of gigabytes) so I'll see how it works with shuf. I'll also take a look at sharding

danpovey commented 1 year ago

Guoguo, I hope you mean your data is hundreds of gb and not your actual gzipped manifest? Hard to believe it could get that big.

Piotr, I think I can think of a way to do this efficiently enough, even when using existing manifests. Suppose you have some cuts.jsonl.gz as your manifest for the training data, and you have previously shuffled it.
This could be split into pieces of, say, 200 lines each, as cuts_split_200/{000,001,..}/cuts_{000, 001, ..}.jsonl.gz, creating a file cuts_split_200_manifests.txt containing the filename of each of the split files, which could be an alternative way to load a CutSet (perhaps)?

I'm afraid I'm not familiar enough with how the data loaders work to be able to flesh out how that would work.

csukuangfj commented 1 year ago

Thanks @pzelasko ! I have a huge cuts.jsonl.gz file (hundreds of gigabytes) so I'll see how it works with shuf. I'll also take a look at sharding

If you indeed have a single file of that large, please consider splitting your dataset into pieces.

Note you don't need to combine the splits into a big one. You can refer to stateless3 from librispeech, which uses gigaspeech + librispeech. You can find how we split gigaspeech.

By the way, I think it takes hours to combine the splits.

pzelasko commented 1 year ago

Piotr, I think I can think of a way to do this efficiently enough, even when using existing manifests. Suppose you have some cuts.jsonl.gz as your manifest for the training data, and you have previously shuffled it. This could be split into pieces of, say, 200 lines each, as cuts_split_200/{000,001,..}/cuts_{000, 001, ..}.jsonl.gz, creating a file cuts_split_200_manifests.txt containing the filename of each of the split files, which could be an alternative way to load a CutSet (perhaps)?

Exactly, and you can achieve something similar with CutSet.split_lazy. The only missing thing is an appropriate constructor for CutSet that will also shuffle the split files order, I can add sth like that. It'd work with existing dataloading out of the box.

chenguoguo commented 1 year ago

@danpovey sorry it's unzipped manifest, it's sitting on the disk at around 450G and the the zipped would be somewhere around 40G. But this is just a portion of the data

And thanks for both of your suggestions, I'll look into that @danpovey @csukuangfj

pzelasko commented 1 year ago

Please see this PR, I added the CutSet constructor that'd work well with manifest shards: https://github.com/lhotse-speech/lhotse/pull/1085

chenguoguo commented 1 year ago

Great, thanks! @pzelasko