danpovey commented 2 years ago

Guys, I have been thinking about this for a few days and I think I am getting close to how we can solve this...

We tend to train on only one dataset (e.g. Librispeech); we rarely try to fold in other datasets (e.g. Gigaspeech). I'm thinking that we need to make it easier to train simultaneously on multiple datasets; this is the easiest way to improve our WERs. (For production purposes, you don't really care whether you used extra data, and lots is available). So why do we train on only one dataset? Using multiple datasets is a hassle partly due to domain mismatch. E.g. Gigaspeech is normalized differently from Librispeech, treats punctuations differently, has different kinds of data; so it would either be a lot of work to merge with Librispeech, or it would degrade results when tested on Librispeech.

My proposal is that we get in the habit of using multiple RNN-T predictors/decoders and joiners, one per dataset. That way we don't have to worry about mismatch. In general we can even use different tokenizers per dataset, and diffeerent vocabulary sizes. That way we can even generalize to multiple languages. However, we need to be careful that we always use all of the datasets on each minibatch. If we don't do this it will cause problems with DDP as it requires all parameters to be touched. We could just manually forward with dummy data if we notice a parameter isn't going to be used, so this isn't a 100% hard requirement, but if possible it's better to always have all datasets in a minibatch. This is known to be better for convergence anyway, versus alternating.

Anyway, my proposal is that we figure out at the Lhotse level how to do this. Now, we will require that the different datasets' features eventually be padded to the same length, so the encoder forward can be shared. (Later the core RNN-T computation can be shared too). So this will require some bucketing-sampler type of logic. What I propose is that we have a list of datasets and the user sets data proportions, as in, each minibatch wants x% of the data from each dataset. And we let one of the datasets define the epoch, e.g. the 1st one, and just cycle through the others somehow (might be complex to enable restarting from an epoch though).

I propose that all of the data elements returned by the dataloader be returned as tuples (of size num-datasets). We can manually concatenate tensors in any tuples that we want concatenated, e.g. the features.
I am thinking there can be an option that controls whether it's mandatory to have data from each dataset (note: if this option is set, it will impose a limit on the utterance length, and may also bias the data proportions unless we work around that somehow).

This is a fairly big feature but I think it's the right direction. Sorry that we have not, so far, done the unsupervised multi-dataset experiments with Gigaspeech that I asked for a feature for. We will do them at some point, with the codebook loss. But I think it will be easier to dip our toe into using multiple datasets by having it all-supervised, just with different RNN-T decoders. That should give a super-easy WER improvement without any real need to do research.

pzelasko commented 2 years ago

I like your idea!

The Lhotse part can be solved using one of these two approaches:

~1) multiple dataloaders~

EDIT: I realized this won't be viable because of different padding in each sub-batch, go straight to 2)

dloaders = [
  DataLoader(DynamicBucketingSampler(libri_cuts, max_duration=150, ...), ...),
  DataLoader(DynamicBucketingSampler(giga_cuts, max_duration=150, ...), ...),
  DataLoader(DynamicBucketingSampler(aishell_cuts, max_duration=150, ...), ...),
]

for batches in zip(dloaders):
  # batches: Tuple[Dict[str, Tensor]]
  # combine feature matrices into a single mini-batch

  # save info about which sequence idx came from which dloader 
  # to use for decoder + joiner delegation later

  # forward, backward, ...

2) zip sampler

sampler = ZipSampler(
  DynamicBucketingSampler(libri_cuts, max_duration=150, ...),
  DynamicBucketingSampler(giga_cuts, max_duration=150, ...),
  DynamicBucketingSampler(aishell_cuts, max_duration=150, ...),
)
dloader = DataLoader(sampler, ...)

for batch in dloader:
  # batch: Dict[str, Tensor]
  # feature matrices are already merged in this case

  # get info about which sequence idx came from which sampler 
  # using batch["supervisions"]["cut"] and matching some pattern in cut IDs
  # (alternatively modify the Dataset class to prepare this info in the worker subprocess)

  # forward, backward, ...

pzelasko commented 2 years ago

What I propose is that we have a list of datasets and the user sets data proportions, as in, each minibatch wants x% of the data from each dataset.

Doable: if we specify the total max_duration and ratio for each corpus, we can compute per-sampler max_duration.

And we let one of the datasets define the epoch, e.g. the 1st one, and just cycle through the others somehow (might be complex to enable restarting from an epoch though).

I can add a RepeatSampler() if it turns out to be needed (currently ZipSampler will end an epoch when the shortest sampler finishes). Then every sampler in ZipSampler except the first can be wrapped with RepeatSampler.

I propose that all of the data elements returned by the dataloader be returned as tuples (of size num-datasets). We can manually concatenate tensors in any tuples that we want concatenated, e.g. the features. I am thinking there can be an option that controls whether it's mandatory to have data from each dataset (note: if this option is set, it will impose a limit on the utterance length, and may also bias the data proportions unless we work around that somehow).

The easiest way to achieve this is using ZipSampler(..., merge_batches=False) which will return a tuple of CutSets, and modifying K2SpeechRecognitionDataset to output data in the format you described.

pzelasko commented 2 years ago

Hmm I see one possible issue, that the bucketing samplers in zip sampler would be unsynchronized between different corpora, so you'd possibly end up getting much shorter cuts from one subset than from the other within a single mini-batch. Let me think about that for a while.

danpovey commented 2 years ago

Mm, might be hard to solve fully, as datasets will have different length distributions. Could do better than random though. We could investigate whether it converges OK when making them alternating, separate minibatches or simply accumulating both before optim.step(). We got rid of batchnorm, which solves that issue.

On Thursday, January 27, 2022, Piotr Żelasko @.***> wrote:

Hmm I see one possible issue, that the bucketing samplers in zip sampler would be unsynchronized between different corpora, so you'd possibly end up getting much shorter cuts from one subset than from the other within a single mini-batch. Let me think about that for a while.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/554#issuecomment-1023261813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZHRXRRW6DCALQCWNTUYFIM3ANCNFSM5M5EQIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

danpovey commented 2 years ago

Dummy loss to avoid unseen parameters

@csukuangfj I am thinking we could add in icefall a function that will touch parameters of an unused module by adding a zero dummy part to the loss function, e.g.:

 loss = loss + icefall.dummy_loss(model.decoders[1], model.joiners[1])

where icefall.dummy_loss just returns 0.0 times the sum of : the sum() of each parameter in the passed-in modules. Then we could, I think, alternate minibatches from different datasets. Of course this is a little limiting, because it requires to have about the same amount of data from each source, unless we sometimes omit one source.

DDP issues if accumulate gradient

If we choose to accumulate from the different datasets before synchronizing gradients, i.e. if we choose to accumulate the gradients, we have to be careful because DDP does not work correctly if you do backward() twice; the second time it would aggregate the already-aggregated gradients from the first minibatch in addition to those from the second minibatch, so the 1st-minibatch's grad would be scaled up by (num_workers). Instead of tensor.backward(), we would need to do autograd.grad() the first time, which involves giving it a list of the parameters we want the gradient for (i.e. the model.parameters()). See: https://discuss.pytorch.org/t/ddp-second-backward-accumulate-the-wrong-gradient/128775/3

Does accumulating gradients even matter?

We can investigate whether it actually matters whether we accumulate gradients, or simply do a step() each time. I have seen claims that it can have bad effects on convergence if you use different types of data in different minibatches rather than combining them, but I don't recall whether:

the models might have had batchnorm, which would definitely cause a problem with data mismatch (our models don't have batchnorm)
the optimizer was using momentum (like Adam does). Momentum effectively accumulates gradients across a sliding window anyway, so presumably might make gradient accumulation less necessary.

csukuangfj commented 2 years ago

I am thinking we could add in icefall a function that will touch parameters of an unused module by adding a zero dummy part to the loss function, e.g.:

I think we can just set find_unused_parameters=True in DPP and DDP will handle the case that some layers are used by some nodes but are not used in other nodes.

The following is a small demo to verify that it is feasible to use find_unused_parameters=True to prevent hanging in DDP.

ex4.py.txt

#!/usr/bin/env python3

import os
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

import torch
import torch.nn as nn

import datetime

def get_data(idx: int):
    ans = torch.tensor([idx, idx + 1, idx + 2], dtype=torch.float32, requires_grad=True)
    return ans

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear0 = nn.Linear(3, 2)
        self.linear1 = nn.Linear(3, 2)
        self.linear2 = nn.Linear(3, 2)

    def forward(self, x: torch.Tensor, idx: int):
        if idx == 0:
            y = self.linear0(x)
        elif idx == 1:
            y = self.linear1(x)
        elif idx == 2:
            y = self.linear2(x)
        else:
            raise ValueError("idx should be 0, 1 or 2")
        return y.sum()

def run(rank: int, world_size: int):
    print(f"world_size: {world_size}")
    device = torch.device("cuda", rank)

    model = Model()
    model.to(device)
    print(f"model: {model}")
    model = DDP(model, device_ids=[rank], find_unused_parameters=True)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for i in range(3):
        print(f"iter: {i}")
        data = get_data(rank + i).to(device)
        print(f"rank: {rank}, data: {data}")

        optimizer.zero_grad()
        y = model(data, rank)
        print("y", y)
        y.backward()
        optimizer.step()

    print(f"rank {rank} done")

def init_process(rank: int, world_size: int, fn):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12357"
    dist.init_process_group(
        "nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(0, 5)
    )
    fn(rank, world_size)

if __name__ == "__main__":
    print(f"dist.is_available: {dist.is_available()}")
    world_size = 3
    processes = []
    mp.set_start_method("spawn")
    for rank in range(world_size):
        p = mp.Process(target=init_process, args=(rank, world_size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

The following lists the output for find_unused_parameters=True and find_unused_parameters=False.

find_unused_parameters=False

(It throws at iteration 1)

dist.is_available: True
world_size: 3
world_size: 3
world_size: 3
model: Model(
  (linear0): Linear(in_features=3, out_features=2, bias=True)
  (linear1): Linear(in_features=3, out_features=2, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
)
model: Model(
  (linear0): Linear(in_features=3, out_features=2, bias=True)
  (linear1): Linear(in_features=3, out_features=2, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
)
model: Model(
  (linear0): Linear(in_features=3, out_features=2, bias=True)
  (linear1): Linear(in_features=3, out_features=2, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
)
iter: 0
iter: 0
iter: 0
rank: 0, data: tensor([0., 1., 2.], device='cuda:0', grad_fn=<CopyBackwards>)
rank: 2, data: tensor([2., 3., 4.], device='cuda:2', grad_fn=<CopyBackwards>)
rank: 1, data: tensor([1., 2., 3.], device='cuda:1', grad_fn=<CopyBackwards>)
y tensor(1.4704, device='cuda:0', grad_fn=<SumBackward0>)
y tensor(-0.6438, device='cuda:2', grad_fn=<SumBackward0>)
y tensor(-0.9560, device='cuda:1', grad_fn=<SumBackward0>)
iter: 1
iter: 1
iter: 1
rank: 1, data: tensor([2., 3., 4.], device='cuda:1', grad_fn=<CopyBackwards>)
rank: 2, data: tensor([3., 4., 5.], device='cuda:2', grad_fn=<CopyBackwards>)
rank: 0, data: tensor([1., 2., 3.], device='cuda:0', grad_fn=<CopyBackwards>)
Process Process-3:
Process Process-1:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "xxx/ex4.py", line 69, in init_process
    fn(rank, world_size)
  File "/xxx/ddp/ex4.py", line 55, in run
    y = model(data, rank)
  File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
  File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/xxx/ex4.py", line 69, in init_process
    fn(rank, world_size)
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
  File "/ceph-fj/fangjun/open-source-2/.n/programming-notes/pytorch/code/ddp/ex4.py", line 55, in run
    y = model(data, rank)
  File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)

... ...

find_unused_parameters=True

(Everything works fine)

iter: 0
iter: 0
iter: 0
rank: 0, data: tensor([0., 1., 2.], device='cuda:0', grad_fn=<CopyBackwards>)
rank: 1, data: tensor([1., 2., 3.], device='cuda:1', grad_fn=<CopyBackwards>)
rank: 2, data: tensor([2., 3., 4.], device='cuda:2', grad_fn=<CopyBackwards>)
y tensor(-1.3472, device='cuda:0', grad_fn=<SumBackward0>)
y tensor(0.7107, device='cuda:1', grad_fn=<SumBackward0>)
y tensor(3.9672, device='cuda:2', grad_fn=<SumBackward0>)
iter: 1
iter: 1
iter: 1
rank: 0, data: tensor([1., 2., 3.], device='cuda:0', grad_fn=<CopyBackwards>)
y tensor(-2.1136, device='cuda:0', grad_fn=<SumBackward0>)
rank: 2, data: tensor([3., 4., 5.], device='cuda:2', grad_fn=<CopyBackwards>)
rank: 1, data: tensor([2., 3., 4.], device='cuda:1', grad_fn=<CopyBackwards>)
y tensor(5.0302, device='cuda:2', grad_fn=<SumBackward0>)
y tensor(0.8880, device='cuda:1', grad_fn=<SumBackward0>)
iter: 2
iter: 2
iter: 2
rank: 0, data: tensor([2., 3., 4.], device='cuda:0', grad_fn=<CopyBackwards>)
rank: 2, data: tensor([4., 5., 6.], device='cuda:2', grad_fn=<CopyBackwards>)
rank: 1, data: tensor([3., 4., 5.], device='cuda:1', grad_fn=<CopyBackwards>)
y tensor(-2.8906, device='cuda:0', grad_fn=<SumBackward0>)
y tensor(1.0535, device='cuda:1', grad_fn=<SumBackward0>)
y tensor(6.0812, device='cuda:2', grad_fn=<SumBackward0>)
rank 0 done
rank 1 done
rank 2 done

danpovey commented 2 years ago

oh, cool. check.if.it.works if.different jobs use.different params.

On Friday, January 28, 2022, Fangjun Kuang @.***> wrote:

I am thinking we could add in icefall a function that will touch parameters of an unused module by adding a zero dummy part to the loss function, e.g.:

I think we can just set find_unused_parameters=True in DPP and DDP will handle the case that some layers are used by some nodes but are not used in other nodes.

The following is a small demo to verify that it is feasible to use find_unused_parameters=True to prevent hanging in DDP.

ex4.py.txt https://github.com/lhotse-speech/lhotse/files/7955789/ex4.py.txt

!/usr/bin/env python3

import osimport torch.distributed as distimport torch.multiprocessing as mpfrom torch.nn.parallel import DistributedDataParallel as DDP import torchimport torch.nn as nn import datetime

def get_data(idx: int): ans = torch.tensor([idx, idx + 1, idx + 2], dtype=torch.float32, requires_grad=True) return ans class Model(torch.nn.Module): def init(self): super().init() self.linear0 = nn.Linear(3, 2) self.linear1 = nn.Linear(3, 2) self.linear2 = nn.Linear(3, 2)
def forward(self, x: torch.Tensor, idx: int):
    if idx == 0:
        y = self.linear0(x)
    elif idx == 1:
        y = self.linear1(x)
    elif idx == 2:
        y = self.linear2(x)
    else:
        raise ValueError("idx should be 0, 1 or 2")
    return y.sum()
def run(rank: int, world_size: int): print(f"world_size: {world_size}") device = torch.device("cuda", rank)
model = Model()
model.to(device)
print(f"model: {model}")
model = DDP(model, device_ids=[rank], find_unused_parameters=True)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for i in range(3):
    print(f"iter: {i}")
    data = get_data(rank + i).to(device)
    print(f"rank: {rank}, data: {data}")

    optimizer.zero_grad()
    y = model(data, rank)
    print("y", y)
    y.backward()
    optimizer.step()

print(f"rank {rank} done")
def init_process(rank: int, world_size: int, fn): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12357" dist.init_process_group( "nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(0, 5) ) fn(rank, world_size)

if name == "main": print(f"dist.is_available: {dist.is_available()}") world_size = 3 processes = [] mp.set_start_method("spawn") for rank in range(world_size): p = mp.Process(target=init_process, args=(rank, world_size, run)) p.start() processes.append(p)
for p in processes:
    p.join()
The following lists the output for find_unused_parameters=True and find_unused_parameters=False. find_unused_parameters=False

(It throws at iteration 1)

dist.is_available: True world_size: 3 world_size: 3 world_size: 3 model: Model( (linear0): Linear(in_features=3, out_features=2, bias=True) (linear1): Linear(in_features=3, out_features=2, bias=True) (linear2): Linear(in_features=3, out_features=2, bias=True) ) model: Model( (linear0): Linear(in_features=3, out_features=2, bias=True) (linear1): Linear(in_features=3, out_features=2, bias=True) (linear2): Linear(in_features=3, out_features=2, bias=True) ) model: Model( (linear0): Linear(in_features=3, out_features=2, bias=True) (linear1): Linear(in_features=3, out_features=2, bias=True) (linear2): Linear(in_features=3, out_features=2, bias=True) ) iter: 0 iter: 0 iter: 0 rank: 0, data: tensor([0., 1., 2.], device='cuda:0', grad_fn=) rank: 2, data: tensor([2., 3., 4.], device='cuda:2', grad_fn=) rank: 1, data: tensor([1., 2., 3.], device='cuda:1', grad_fn=) y tensor(1.4704, device='cuda:0', grad_fn=) y tensor(-0.6438, device='cuda:2', grad_fn=) y tensor(-0.9560, device='cuda:1', grad_fn=) iter: 1 iter: 1 iter: 1 rank: 1, data: tensor([2., 3., 4.], device='cuda:1', grad_fn=) rank: 2, data: tensor([3., 4., 5.], device='cuda:2', grad_fn=) rank: 0, data: tensor([1., 2., 3.], device='cuda:0', grad_fn=) Process Process-3: Process Process-1: Process Process-2: Traceback (most recent call last): Traceback (most recent call last): File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "xxx/ex4.py", line 69, in init_process fn(rank, world_size) File "/xxx/ddp/ex4.py", line 55, in run y = model(data, rank) File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 606, in forward if self.reducer._rebuild_buckets(): File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/root/fangjun/open-source/pyenv/versions/3.8.6/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/xxx/ex4.py", line 69, in init_process fn(rank, world_size) RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). File "/ceph-fj/fangjun/open-source-2/.n/programming-notes/pytorch/code/ddp/ex4.py", line 55, in run y = model(data, rank) File "/ceph-fj/fangjun/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs)

... ...

find_unused_parameters=True

(Everything works fine)

iter: 0 iter: 0 iter: 0 rank: 0, data: tensor([0., 1., 2.], device='cuda:0', grad_fn=) rank: 1, data: tensor([1., 2., 3.], device='cuda:1', grad_fn=) rank: 2, data: tensor([2., 3., 4.], device='cuda:2', grad_fn=) y tensor(-1.3472, device='cuda:0', grad_fn=) y tensor(0.7107, device='cuda:1', grad_fn=) y tensor(3.9672, device='cuda:2', grad_fn=) iter: 1 iter: 1 iter: 1 rank: 0, data: tensor([1., 2., 3.], device='cuda:0', grad_fn=) y tensor(-2.1136, device='cuda:0', grad_fn=) rank: 2, data: tensor([3., 4., 5.], device='cuda:2', grad_fn=) rank: 1, data: tensor([2., 3., 4.], device='cuda:1', grad_fn=) y tensor(5.0302, device='cuda:2', grad_fn=) y tensor(0.8880, device='cuda:1', grad_fn=) iter: 2 iter: 2 iter: 2 rank: 0, data: tensor([2., 3., 4.], device='cuda:0', grad_fn=) rank: 2, data: tensor([4., 5., 6.], device='cuda:2', grad_fn=) rank: 1, data: tensor([3., 4., 5.], device='cuda:1', grad_fn=) y tensor(-2.8906, device='cuda:0', grad_fn=) y tensor(1.0535, device='cuda:1', grad_fn=) y tensor(6.0812, device='cuda:2', grad_fn=) rank 0 done rank 1 done rank 2 done

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/554#issuecomment-1023886526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7LEFSQKB3J4LPGPB3UYIOV5ANCNFSM5M5EQIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

csukuangfj commented 2 years ago

check.if.it.works if.different jobs use.different params.

Do you mean node_1 uses model_1, node_2 uses model_2, ..., node_N uses model_N?

danpovey commented 2 years ago

or random

On Friday, January 28, 2022, Fangjun Kuang @.***> wrote:

check.if.it.works if.different jobs use.different params.

Do you mean node_1 uses model_1, node_2 uses model_2, ..., node_N uses model_N?

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/554#issuecomment-1023916526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6CCLLF7J7P34UD56LUYIXJ5ANCNFSM5M5EQIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

csukuangfj commented 2 years ago

    model = Model()
    model.to(device)
    print(f"model: {model}")
    model = DDP(model, device_ids=[rank], find_unused_parameters=True)

I think all nodes use the same model. For each iteration, different nodes may process different layers of the model.

check.if.it.works if.different jobs use.different params.

jobs here mean nodes, I think. All nodes start with a single model that has the same parameters before the training. What do you mean by using different params?

csukuangfj commented 2 years ago

The support for multiple datasets is partially solved in https://github.com/lhotse-speech/lhotse/pull/565

It is partially solved since the original proposal writes:

What I propose is that we have a list of datasets and the user sets data proportions, as in, each minibatch wants x% of the data from each dataset. And we let one of the datasets define the epoch, e.g. the 1st one, and just cycle through the others somehow (might be complex to enable restarting from an epoch though).

565 iterates through all given datasets instead of stopping when one of the datasets is exhausted.

Suppose that when we combine librispeech and gigaspeech, we would want to start a new epoch as soon as the librispeech dataset is exhausted. Since gigaspeech is way larger than librispeech, we don't want to increase the training time per epoch too much when introducing the gigaspeech dataset into the training pipeline for the librispeech dataset.

danpovey commented 2 years ago

Cool! I just want to remind you that the current plan RE, librispeech+gigaspeech is to have them in separate minibatches, which would require separate samplers (so we wouldn't do that multiplexing).

ngoel17 commented 2 years ago

What do you think about combining segments from different datasets to create "longer" segments that satisfy the "max-duration" constraint? Possibly padding the in-between space with silence.

csukuangfj commented 2 years ago

What do you think about combining segments from different datasets to create "longer" segments that satisfy the "max-duration" constraint? Possibly padding the in-between space with silence.

I am worried about the attention stuff. Frames from one utterance can attend on frames from other utterances in case of concatenation.

But if your model doesn't not require full context, concatenation is fine, I think.

pzelasko commented 2 years ago

Wouldn't a well-behaved attention learn to attend to the right utterances? It actually seems like it could be easier than in case of long (15s+) utterances because each utterance is more distinct (speaker/channel), giving an extra clue to the model.

csukuangfj commented 2 years ago

That's what I am worried about. We can try and see whether it helps.

pzelasko commented 2 years ago

If you're going to try it, there are existing functions in Lhotse to make it easy, just add these lines somewhere in dataset (or use as transforms):

# cuts is a mini-batch CutSet, not the full dataset
cuts = CutConcatenate(...)(cuts)  # set concatenation args in __init__
cuts = cuts.merge_supervisions()  # creates one big supervision with concatenated text

ahmedalbahnasawy commented 2 years ago

@pzelasko @danpovey Do you think combining several datasets such as Librispeech, Gigaspeech, tedlium and common-voice would help for better generalization. However, I did this experiment in kaldi but the model was extremely weak. Do you think k2-icefall can tackle this problem?

csukuangfj commented 2 years ago

@pzelasko @danpovey Do you think combining several datasets such as Librispeech, Gigaspeech, tedlium and common-voice would help for better generalization. However, I did this experiment in kaldi but the model was extremely weak. Do you think k2-icefall can tackle this problem?

We actually support using multiple dataset for training in icefall.

Results show that it indeed helps the training. For instance, we have been using Gigaspeech + LibriSpeech for pruned RNN-T training, the resulting model gives a WER 2.0 on test-clean of librispeech.

The results are available at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#librispeech-bpe-training-results-pruned-transducer-3-2022-05-13

Note: For multi-dataset training in icefall, utterances in one batch all come from the same dataset. We randomly select a batch from some dataset during each iteration in training; while in lhotse, utterances in one batch may come from multiple datasets, I believe.

ngoel17 commented 2 years ago

Just FYI, I did try combining multiple datasets more directly with same normalization, and that helped also. I made sure that the target domain batches were frequent enough and that other datasets were mixed in using Lhotse.mux().

On Wed, Oct 19, 2022, 4:51 AM Fangjun Kuang @.***> wrote:

@pzelasko https://github.com/pzelasko @danpovey https://github.com/danpovey Do you think combining several datasets such as Librispeech, Gigaspeech, tedlium and common-voice would help for better generalization. However, I did this experiment in kaldi but the model was extremely weak. Do you think k2-icefall can tackle this problem?

We actually support using multiple dataset for training in icefall.

Results show that it indeed helps the training. For instance, we have been using Gigaspeech + LibriSpeech for pruned RNN-T training, the resulting model gives a WER 2.0 on test-clean of librispeech.

The results are available at

https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#librispeech-bpe-training-results-pruned-transducer-3-2022-05-13

Note: For multi-dataset training in icefall, utterances in one batch all come from the same dataset. We randomly select a batch from some dataset during each iteration in training; while in lhotse, utterances in one batch may come from multiple datasets, I believe.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/554#issuecomment-1283655669, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6F4E2U7243XNAQAMADWD6ZA7ANCNFSM5M5EQIAQ . You are receiving this because you commented.Message ID: @.***>

ahmedalbahnasawy commented 2 years ago

Thanks @csukuangfj for your answer. I already tested Stateless3 model (librispeech+Giga speech) it is really robust model. However, the English abbreviations, streets names and names still can't be recognized correctly. This is why i was thinking to add tedlium and common-voice on (libri+Giga) speech dataset. From my basic info about RNN-T that it has implicit language model. Do you think it is doable to fine tune on of these obtained model by adding more words to the lexicon.txt file similar to kaldi. @ngoel17 it makes a lot of sense.

lhotse-speech / lhotse

How to train with multiple datasets #554

Dummy loss to avoid unseen parameters

DDP issues if accumulate gradient

Does accumulating gradients even matter?

find_unused_parameters=False

find_unused_parameters=True

!/usr/bin/env python3

565 iterates through all given datasets instead of stopping when one of the datasets is exhausted.

https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#librispeech-bpe-training-results-pruned-transducer-3-2022-05-13