Closed tiansiyuan closed 12 months ago
It was fine with 1.2.15 and 1.2.16.
Reproduced with 1.2.18.
I've also been encountering this. It might be drop_last on dataloader when the batch_size doesn't perfectly divide your dataset size, considering the issue came up in 1.2.17, but I haven't gotten the chance to read how exactly drop_last interacts with things yet
@LWprogramming @tiansiyuan oh that would be strange? are you two seeing this when training on multiple gpus? i've turned it into an option; let me know if turning it off fixes it https://github.com/lucidrains/audiolm-pytorch/commit/d491046de3e4e24e191aa94f98f34bc4c337ac04
yes on multiple gpus. i've been using encodec so i was seeing it in semantic/coarse/fine instead of soundstream since i dont end up training it. Can confirm that they train fine after turning it off now.
@LWprogramming ohh got it, thanks for confirming! this may be an issue with accelerate then
maybe worth exploring whether turning on split batches fixes this
what are your thoughts on why it might allow for drop_last? Based on the source docs it just seems like it's yielding data differently but should still never be None. (From a purely selfish angle, for my personal training runs getting data in any order is fine by me :) there's no special structure to the data that needs to be shuffled away)
I have this issue with single GPU or with CPU.
Reproduced with 1.2.19 and 1.2.20.
Batch size only affects memory used in training and training speed?
I just changed the batch size to run the demo without errors, but now it takes a very long time for 1 step in RTX4090.
I just changed the batch size to run the demo without errors, but now it takes a very long time for 1 step in RTX4090.
I found it is because the valid dataset only contains 1 data, so I set valid_frac=0.0 just to run the demo
@seaniezhao thanks for clueing us in! @tiansiyuan i've added a few new error messages in an updated version
do you want to see if it triggers before starting training?
@seaniezhao thanks for clueing us in! @tiansiyuan i've added a few new error messages in an updated version
do you want to see if it triggers before starting training?
Yes,I verify with version 1.2.21 that it gives:
AssertionError: dataset must have sufficient samples for training
when I use dataset_folder = "placeholder_dataset"
When I switch to dataset_folder = "dev-clean"
, training works ok:
training with dataset of 2567 samples and validating with randomly splitted 136 samples ...... training complete
hurray, thank you @seaniezhao
Hi
When I the cell under SoundStream in notebook run audiolm_pytorch_demo.ipynb, I get:
TypeError Traceback (most recent call last) Cell In [5], line 19 6 trainer = SoundStreamTrainer( 7 soundstream, 8 folder = dataset_folder, (...) 14 num_train_steps = 9 15 ) #.cuda() 16 # NOTE: I changed num_trainsteps to 9 (aka 8 + 1) from 10000 to make things go faster for demo purposes 17 # adjusting save*_every variables for the same reason ---> 19 trainer.train()
File /opt/conda/lib/python3.8/site-packages/audiolm_pytorch/trainer.py:552, in SoundStreamTrainer.train(self, log_fn) 549 def train(self, log_fn = noop): 551 while self.steps < self.num_train_steps: --> 552 logs = self.train_step() 553 log_fn(logs) 555 self.print('training complete')
File /opt/conda/lib/python3.8/site-packages/audiolm_pytorch/trainer.py:420, in SoundStreamTrainer.trainstep(self) 417 # update vae (generator) 419 for in range(self.grad_accum_every): --> 420 wave, = next(self.dl_iter) 421 wave = wave.to(device) 423 loss, (recon_loss, multi_spectral_recon_loss, adversarial_loss, feature_loss, all_commitment_loss) = self.soundstream(wave, return_loss_breakdown = True)
TypeError: cannot unpack non-iterable NoneType object
How to solve this problem?
Thanks,
Tian