lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

Training SemanticTransformer with accelerate #152

Open stevenhillis opened 1 year ago

stevenhillis commented 1 year ago

When I try to train the semantic transformer with accelerate (accelerate launch train_semantic.py, where train_semantic.py is lifted directly from the readme), I get

Traceback (most recent call last): File "/big/users/steven-hillis/code/tts/audiolm-pytorch/train_semantic.py", line 45, in main() File "/big/users/steven-hillis/code/tts/audiolm-pytorch/train_semantic.py", line 22, in main File "/big/users/steven-hillis/code/tts/audiolm-pytorch/audiolm_pytorch/trainer.py", line 638, in init trainer = SemanticTransformerTrainer( File "<@beartype(audiolm_pytorch.trainer.SemanticTransformerTrainer.init) at 0x7f3eeda18160>", line 148, in init ) = self.accelerator.prepare( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1122, in prepare File "/big/users/steven-hillis/code/tts/audiolm-pytorch/audiolm_pytorch/trainer.py", line 638, in init result = tuple( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1123, in ) = self.accelerator.prepare( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1122, in prepare self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 977, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare_model result = tuple( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1123, in model = torch.nn.parallel.DistributedDataParallel( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 657, in init self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 977, in _prepare_one _sync_module_states( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/distributed/utils.py", line 136, in _sync_module_states return self.prepare_model(obj, device_placement=device_placement) File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare_model _sync_params_and_buffers( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/distributed/utils.py", line 154, in _sync_params_and_buffers dist._broadcast_coalesced( RuntimeError: The size of tensor a (64) must match the size of tensor b (0) at non-singleton dimension 2 model = torch.nn.parallel.DistributedDataParallel( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 657, in init _sync_module_states( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/distributed/utils.py", line 136, in _sync_module_states _sync_params_and_buffers( File "/home/users/steven-hillis/anaconda3/envs/audiolm/lib/python3.10/site-packages/torch/distributed/utils.py", line 154, in _sync_params_and_buffers dist._broadcast_coalesced(

My accelerate config is okay, since it works for training soundstream model. I also get no errors when I run python train_semantic.py for single-gpu training. There's a specific problem with accelerate preparing the SemanticTransformer model.

lucidrains commented 1 year ago

are you using hubert?

stevenhillis commented 1 year ago

Yes, hubert.

from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    # has_condition = True,               # this will have to be set to True
    # cond_as_self_attn_prefix = True     # whether to condition as prefix to self attention, instead of cross attention, as was done in 'VALL-E' paper
).cuda()

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    paths_list_path = '/path/to/training/manifest.txt',
    batch_size = 256,
    grad_accum_every = 1,
    dl_num_workers = 8,
    data_max_length_seconds = 2,
    num_train_steps = 1_000_000,
    force_clear_prev_results = True,
    accelerate_kwargs = {'log_with': 'wandb', 'project_dir': "./runs"},
    results_folder='./results/semantic/'
)

trainer.train()
lucidrains commented 1 year ago

@stevenhillis ok, i'm not sure if these fairseq models are compatible with accelerate

i'll try it out this weekend, and i'll also make sure the transformers can accept pre-encoded semantic token ids, if this is an issue

lucidrains commented 1 year ago

@stevenhillis deepgram is doing generative models now? i thought they were just speech to text?

stevenhillis commented 1 year ago

I chased down the fairseq model idea a little, and I don't think that's it. The pretrained huberts are on huggingface too, and I can get an embed of the right size from theirs with

# hidden_states return object contains embedding output, followed by outputs of each layer
AutoModel.from_pretrained("facebook/hubert-base-ls960").eval()(inputs, output_hidden_states=True).hidden_states[output_layer + 1]

It doesn't match the output of the fairseq implementation (although those outputs are themselves quite nondeterministic), but more importantly, I get the same error accelerating train_semantic.py with the huggingface hubert as with the fairseq.

Moreover, I also get the same error when trying to launch a train_fine.py script with accelerate.

stevenhillis commented 1 year ago

Transcription is still the core product offering, but the market is ready for more! We'll be doing a bunch of work on generative modeling for speech and text this year. Always hiring researchers!

lucidrains commented 1 year ago

Transcription is still the core product offering, but the market is ready for more! We'll be doing a bunch of work on generative modeling for speech and text this year. Always hiring researchers!

I'm good thanks. ok will look into it later this weekend

lzl1456 commented 1 year ago

oh Pity ! I extracted the token in advance,so I don't need to import the hubert model 。 but But still this problem occurs when use accelerate launch multi-GPu train SemanticTransformer
image

do you have a plan to fix this @lucidrains @stevenhillis

lzl1456 commented 1 year ago

I think I solved the problem

the class Attention(nn.Module):
add code

if num_null_kv > 0: self.null_kv = nn.Parameter(torch.randn(2, num_null_kv, dim_head))