Add Transformer-Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

jp1924 commented 1 year ago

Model description

paper: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Transformer-Transducer is a an End2End-based ASR streaming model that converts spoken speech into text in real time.
model that implements RNN-T as Transformer and train using RNN-T loss.
It consists of Label Encoder in charge of text, Audio Encoder in charge of voice, and Joint Network that combines the calculations of each Encoder
And in order not to exceed the max_length of the Transformer, the audio is converted into log-Mel Spectrogram, and then each Mel is stacked to match the voice within the max_length

Open source status

[X] The model implementation is available
[ ] The model weights are available

Provide useful links for the implementation

jp1924: Transformer-Transducer

sanchit-gandhi commented 1 year ago

Hey @jp1924! Thanks for opening this new model request. The Transformer-Transducer is indeed an exciting architecture in speech recognition - one that achieves both strong performance and low latency.

My hesitations in adding this model stem from the fact that the weights are not open-sourced. It's very rare that we add a model to transformers when the weights are not open-sourced; having no weights means that the model can only be used with randomly initialised parameters, which is not much good for downstream ASR!

Models with randomly initialised weights require extensive training with generous GPU/TPU compute to produce trained checkpoints. In many cases, it is very difficult to reproduce the exact results from the paper due to differences in data, training set-up and compute resources.

On the other hand, pre-trained models (weights) are extremely valuable to the community as they can be used directly without training (or with a small amount of fine-tuning) for downstream inference tasks. Consequently, we tend to focus our efforts in transformers on adding models where pre-trained weights are available.

This is not as to discourage you from contributing a Transformer-Transducer model to transformers. Such a contribution would be most welcome! However, taking into account the above points, I would advise that you focus the implementation on a Transformer-Transducer codebase where strong pre-trained weights are available and open-sourced. I'm more than happy to help find a suitable codebase + weights to port! This would be a valuable addition to the transformers library.

flozi00 commented 1 year ago

Thanks to project like hivemind and other community members like @fxtentacle it is possible to organize the computing capacity.

If Hajo is still interested in that use case we could try to pretrain an German model, what do you think?

jp1924 commented 1 year ago

Thank you advise! @sanchit-gandhi! In addition to implementing the code, I will find a way to upload weight!

The contents below have nothing to do with the exact implementation of the model! Actually, as you said, it takes a lot of resources to train Transformer-Transducers. it's need to run 730 epoch when i set it to the hyper-parameter described in the paper.

The problem with this is that, in a way, we have to train Encoders from scratch So what I'm thinking experimentally is that I'm thinking about changing Audio and Lable Encoder to a PreTraining model(like Wav2Vec2 or BERT).

You can always do that if model can help with the project you! @flozi00!

But this model hasn't been validated yet. I don't know when you start Pretrain, but I need to stabilize the algorithm of generate or tokenizer, so can you wait a little bit? I'm making this at the same time as the company's work, so I think it'll take some time!

And I have a question about using German data to proceed with training.

What dataset are you going to use?
Is it possible to verify the model when training the model using the data?
Do you have a comparator to use for verification?

There's this much question. I'd appreciate an answer!

From here on, it's about verification! You don't have to read it if you don't need it.

This is an empirical story. When I measured the performance of my native language data, KsponSpeech, using Test-Clean, the performance of Wav2Vec2 was around 20%(WER), and RNN-T was around 30%. I think the range of performance will be 5-10% if German is also taught.

sanchit-gandhi commented 1 year ago

So what I'm thinking experimentally is that I'm thinking about changing Audio and Lable Encoder to a PreTraining model(like Wav2Vec2 or BERT)

This is a good idea! The only constraint then is that our encoder network must take the same architecture as Wav2Vec2 in order for all of the pre-trained weights to be compatible with the network.

Since Wav2Vec2 is a different architecture to the Transformer network used in the Transformer-Transducer model, we'll likely only be able to load a subset of the pre-trained weights into the T-T model this way.

jp1924 commented 1 year ago

Thank you for answering even though it's an experimental idea! @sanchit-gandhi

There's something I didn't understand while reading. I didn't understand the "subset" well, but the structures of Wav2Vec2 and T-T models are different, so do you want to bring only the encoder part of the pre-trained Wav2Vec2?

sanchit-gandhi commented 1 year ago

Hey @jp1924 - we'll only be able to load the weights of the Wav2Vec2 model into the Transformer-Transducer if the T-T has the same encoder architecture as Wav2Vec2. If it doesn't then this won't be possible (or we'll only be able to load the weights that do match, which still leaves some randomly initialised)

sanchit-gandhi commented 1 year ago

I'd like to reiterate that adding a T-T model to Transformers would be amazing and think it's great you're excited by this too!

We should be selective though in adding a model where weights are already available, preferably 'official' ones as it's very hard to emulate these strong pre-trained checkpoints without the data/compute.

If this isn't the case, it's very difficult to justify adding the model to transformers (the torch model is not much use without the trained params to go with it!)

jp1924 commented 1 year ago

Sorry for the late reply! @sanchit-gandhi I'll experiment right away, I think it'll be possible if I just modify the Encoder part of the Transformer Transducer Model!

But there's one thing I'm worried about. It's the CNN layer of wav2vec2, and the Streaming Model will have at least 25ms of voice. But I don't know how the CNN class will react to this. Maybe we need more experiments on this.

This is the next best way to try if the above method doesn't work. You don't have to read it. The second best way is to pretrain AudioEncoder using gumble-softmax.

In fact, the difference between Wav2Vec2 and T-T's Audio Encoder is the difference in how raw-audio is put into the Transformer Encoder. Wav2Vec2 compresses the voice using CNN, and T-T converts audio to Mel, compresses the voice through windowing, and puts it in the encoder layer.

Then my idea is to convert audio into a windowed mel to pretrain the AudioEncoder of T-T.

It's just that I don't like it either. Because the way to pretrain the T-T model is obviously going to take a lot of resources and time... if possible, this is the last way I want to use it.....

sanchit-gandhi commented 1 year ago

Hey @jp1924, my feelings are that pre-training the model are going to be difficult from two perspectives:

Hacky architecture: we won't be able to pre-train the correct T-T architecture, but some modified Wav2Vec2-T version
Pre-training is expensive, both in terms of time and compute

Also, I'd like to re-iterate that it's very unlikely that we'd add such a model to Transformers - we can only really add 'official' implementations (i.e. the 'official' code and the 'official' weights), see https://github.com/huggingface/transformers/issues/20961#issuecomment-1382245091.

My recommendation would be to find an official T-T implementation where both the code and weights are available and see whether we can add this to Transformers!

Feel free to post any findings here - we can discuss them and pick the right one for the T-T integration!

Re-iterating my excitement for the T-T integration! We need to find 'official' code + checkpoints before we commit to integrating

jp1924 commented 1 year ago

Thank you for leaving a comment @sanchit-gandhi

The code and weight of the T-T model have not been officially released... and even the code of the T-T model that the users made personally has no weight. The code is not an formula, but is it possible to use that code to learn the model and upload it to the hub? Of course, when I heard that the model I learned had similar performance as the paper.

fxtentacle commented 1 year ago

@flozi00 I personally probably have to pass on releasing an open-source T-T model due to a non-compete covering a closed-source T-T which I built. That said, the last time I talked to them, the University of Lübeck still had a few NVIDIA DGX available for research projects. The main requirement for such research GPU use is to write a 2-3 page paper about what worked and what didn't afterwards, so it's not a very big hurdle.

@sanchit-gandhi In my experience, a T-T can be trained quite cheaply with transfer learning. For the label encoder, you force it to produce the same logits as a pre-trained T5 (out of which there are plenty on HF). For the acoustic encoder, you force it to imitate the logits from a pre-trained wav2vec2. You can even pre-compute the label and acoustic logit I/O pairs as a temporary dataset. Because you're now training the T-T components against fixed I/O pairs, as opposed to doing alignment while training, they will converge really quickly, like a few days on a A100 each. For the join/merge network, you can pre-generate forced alignment data (e.g. from wav2vec2) and then train against those.

YooSungHyun commented 1 year ago

@fxtentacle Hi!, In my experience, pre-trained wav2vec is full-attention model. so, I think that is not useful on T-T

When I printed output 1) 10sec and 2) 1sec in 10sec, I compared 1)'s vector for 1sec and 2)'s vector, that is diffrent value each other.

so, I think, i talk 'i'm so hun' and 'i'm so hungry' 'hun' sound's acoustic vector is not verfied!

maybe, did you want to say about pre-trained wav2vec2 model on trained streaming-like dataset?

fxtentacle commented 1 year ago

@YooSungHyun when you have a dataset of audio, you can use a pre-trained wav2vec2 to generate logits for every timestep. Normally, you would then resolve those logits into text using the language model, but instead you can also just save them as a new dataset. So then you have the raw audio one the one side and the time-aligned logits from wav2vec2 on the other side. And that data can be used to train the acoustic encoder of a T-T. You feed a chunk of the raw audio into the encoder and then use the difference to your "known good" logits from wav2vec2 as the loss signal. Doing so removes the uncertainty w.r.t. the time alignment, because you already know where in time each logit was emitted by wav2vec2. And that greatly speeds up training the acoustic encoder, because you can use an absolute error loss instead of using a CTC loss. And that produces a much cleaner gradient to learn from.

jp1924 commented 1 year ago

Thank you for your idea! @fxtentacle I tested it based on your idea!

I understand what @YooSungHyun said

when the full voice "i'm so hungry" was input.     
in streaming case, the corresponding voice of "l'm" -> "so" -> "hugry" is come in order,     
but Wav2Vec2 has a difference in the value of the vector    
when a full voice like "i'm so hungry" is received and when "l'm" -> "so" -> "hugry" is partially received.

So the solution to this problem is

If there is a difference, let's make the split_vector similar or equal to each part of the full_vector through    
the loss calculation between the full_vector from "i'm so hungry" and the split_vector from     
split_audio (e.g., when separated per second).

So based on the above understanding, I made the code below, but there was a problem.

from transformers import Wav2Vec2Model, Wav2Vec2Config
import torch

def main() -> None:
    model_name = r"patrickvonplaten/wav2vec2-librispeech-clean-100h-demo-dist"
    cache_dir = r""

    config = Wav2Vec2Config.from_pretrained(
        model_name,
        cache_dir=cache_dir,
        apply_spec_augment=False,
    )
    model = Wav2Vec2Model.from_pretrained(model_name, cache_dir=cache_dir, config=config)

    sampling_rate = 16000
    batch_size = 2
    audio_size = [254080, 101600, 293600, 82880]
    #    sec   =  15.88,  6.35,   18.35,  5.18

    dummy_datas = [torch.rand((batch_size, audio_len)) for audio_len in audio_size]

    for full_audio in dummy_datas:
        outputs = model(full_audio)
        labels = outputs[0]

        input_values = torch.zeros(labels.size())
        full_size = full_audio.size(1)
        stack_size = 0
        check_list = list()  # it's for test

        # [NOTE]: Cut the voice in 1 seconds.
        #         If a 15.88 second voice is cut per second, 16 split_audios are generated.
        for idx, split_idx in enumerate(range(0, full_size, sampling_rate), start=1):
            split_audio = full_audio[:, split_idx : (split_idx + sampling_rate)]

            outputs = model(split_audio)
            hidden_states = outputs[0]
            check_list.append(hidden_states)
            hidden_size = hidden_states.shape[1]

            input_values[:, stack_size : stack_size + hidden_size] = hidden_states
            stack_size += hidden_size

        state_size = sum([state.shape[1] for state in check_list])
        print("\n---------- result ----------")
        print(f"audio_length: {full_audio.shape[1] / sampling_rate}")
        print(f"labels_length: {labels.shape[1]}")
        print(f"actual_length: {state_size}")
        print(f"differece: {labels.shape[1] - state_size}")
        print(f"repeat_num: {idx}")

if "__main__" in __name__:
    main()

For example, if you put full_audio with size n into Wav2Vec2, you get labels with length 7 of vector.

Then, cut the full_audio with the size of n, get four split_audio per second, put it in wav2vec2 to get split_vector, and add the values to get input_values.

My opinion is that the length between labels and input_values should be the same when audio is processed in the above way. However, there is a difference in length when I turn the code above.

The picture below is a brief description of the problem.
issue_diagram

When you actually rewind the code above, a difference of 15 occurs when you extract labels and input_values from a voice of 15.88 seconds. The reason why the difference is 15 instead of 16 is because if length - 1 is applied to all the audio input, even the actual label would have been length -1, so the difference would be 15 instead of 16.

The serious point of this problem is that the difference in length between input_values and labels increases in proportion to the length of the voice.

When I looked up the cause, I think the length-1 problem occurs while going through the Wav2Vec2FeatureEncoder (CNN).

The solution I think is to add 0 pad to split_vector and make 0 pad xavier, kaming initialize, etc., but I'm worried because it's not a fundamental solution.

Is there any way to fundamentally solve the problem other than attaching a pad?

huggingface / transformers