huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.62k stars 26.92k forks source link

Examples for speech recognition trainings from scratch #14523

Closed stefan-falk closed 2 years ago

stefan-falk commented 2 years ago

πŸš€ Feature request

Fine-tuning is rather straight forward but it looks to me as if running a training from scratch isn't. I am rather new to πŸ€— but from what I've learned to far is that it's rather tricky to get by and find out how to start a new Speech2Text training (for example).

We got run_wav2vec2_pretraining_no_trainer.py in order to train a new Wav2Vec2 model from scratch but I wonder why this is (explicitly) not using the Trainer API? Is there any particular reason?

Motivation

After running into out-of-memory issues during Wav2Vec2 trainings I figured it would be better to use a smaller model for this purpose. Since training an end-to-end model using Wav2Vec2 requires multiple stages I thought it would be better to start with a simple Speech2Text transformer model and continue from there. However, up until now I am unable to properly run a training. For some reason the word-error-rate is basically 0% from the start only to get worse over time to a point where the model is not predicting anything anymore. I have no explanation for this but you can take a look at the code that (in a sense) brought me here.

Code (click to expand) ```python import json import os from dataclasses import dataclass from functools import partial from typing import List, Dict, Union import torch import tqdm from torch.nn.utils.rnn import pad_sequence from torch.utils.data import IterableDataset from transformers import TrainingArguments, Trainer, trainer_utils, Speech2TextTokenizer, Speech2TextFeatureExtractor, \ Speech2TextProcessor, Speech2TextConfig, Speech2TextModel, Speech2TextForConditionalGeneration, Seq2SeqTrainer, \ IntervalStrategy, EarlyStoppingCallback, Seq2SeqTrainingArguments import sentencepiece as spm import tensorflow as tf from .speech.bin.hf_train import get_dataset, get_preprocessor from .speech.data.speech_dataset import SpeechRecognitionDatasets from .speech import bin as binaries from .speech.lab.training.metrics import error_rate import numpy as np class Speech2TextTFDataset(IterableDataset): def __init__(self, processor: Speech2TextProcessor, text_preprocessor, dataset: tf.data.Dataset, num_samples: int = None): self.processor = processor self.text_preprocessor = text_preprocessor self.dataset = dataset self.num_samples = num_samples def __len__(self): if self.num_samples is None: raise RuntimeError("Number of samples is unknown.") return self.num_samples def __getitem__(self, item): raise NotImplementedError def __iter__(self): for example in self.dataset: inputs = example["inputs"] targets = example["targets"].numpy()[0].decode() targets = self.text_preprocessor.preprocess(targets) sampling_rate = self.processor.feature_extractor.sampling_rate # Extract features & target labels audio_features = self.processor.feature_extractor(inputs, sampling_rate=sampling_rate)["input_features"][0] labels = self.processor.tokenizer.encode(targets) size, _ = audio_features.shape attention_mask = torch.ones(size) yield dict(inputs=audio_features, targets=labels, attention_mask=attention_mask) @classmethod def get_split(cls, processor, text_preprocessor, datasets: SpeechRecognitionDatasets, split: str, max_samples=None): dataset = datasets.get(split, load_noise=False) if split == "train": dataset = dataset.repeat() if max_samples is not None: dataset = dataset.take(max_samples) num_samples = datasets.get_num_speech_samples(split) return cls(processor, text_preprocessor, dataset, num_samples=num_samples) @dataclass class Speech2TextCollator: def __init__(self, processor: Speech2TextProcessor): self.processor = processor def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: inputs = [torch.Tensor(f["inputs"]) for f in features] targets = [torch.Tensor(f["targets"]) for f in features] # Create batches inputs_batch = pad_sequence(inputs, batch_first=True) targets_batch = pad_sequence(targets, batch_first=True).long() attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long() return dict( input_features=inputs_batch, # decoder_input_ids=targets_batch, attention_mask=attention_mask, labels=targets_batch ) def compute_metrics(processor: Speech2TextProcessor, pred): # pred_logits = pred.predictions pred_ids = np.argmax(pred.predictions[0], axis=-1) pred_str = processor.batch_decode(pred_ids) # we do not want to group tokens when computing the metrics label_str = processor.batch_decode(pred.label_ids, group_tokens=False) wer = error_rate(targets=label_str, predictions=pred_str, tokens="words") cer = error_rate(targets=label_str, predictions=pred_str, tokens="characters") return {"wer": wer, "cer": cer} def get_sentence_piece_model(sentence_generator, text_preprocessor, overwrite=False): model_prefix = "/tmp/en" vocab_file = model_prefix + ".json" spm_file = model_prefix + ".model" if os.path.exists(vocab_file) and os.path.exists(spm_file) and not overwrite: return vocab_file, spm_file text_fp = "/tmp/spm.txt" with open(text_fp, "w") as f: for sentence in sentence_generator(): text = sentence.strip() text = text_preprocessor.preprocess(text) f.write(text) f.write("\n") spm.SentencePieceTrainer.Train( input=text_fp, vocab_size=1000, model_prefix=model_prefix, user_defined_symbols=[""], # hard_vocab_limit=False, ) processor = spm.SentencePieceProcessor() processor.Load(model_file=model_prefix + ".model") vocab_file = model_prefix + ".json" spm_file = model_prefix + ".model" # noinspection PyUnresolvedReferences vocab = {processor.id_to_piece(piece_id): piece_id for piece_id in range(processor.get_piece_size())} with open(vocab_file, "w") as f: json.dump(vocab, f, indent=2) return vocab_file, spm_file def main(): # TODO Paths! local_raw_root = "/data/-asr/corpora/raw" local_shards_root = "/data/-asr/corpora/sharded" # TODO Paths! remote_raw_root = "/mariana/asr/raw" remote_shards_root = "/mariana/asr/corpora/sharded" remote_converted_root = "/mariana/asr/corpora/converted" remote_vocabs_root = "/mariana/asr/vocabularies/masking" tf.config.set_visible_devices([], "GPU") out_dir = "/data/-asr/models/huggingface/dev" log_dir = os.path.join(out_dir, "logs") config_fp = os.path.join(os.path.dirname(binaries.__file__), "configs/data/en/timit.yml") early_stopping_patience = 5 print(f"data config: {config_fp}") with tf.device("cpu"): asr_datasets = get_dataset( config_fp=config_fp, local_raw_root=local_raw_root, local_shards_root=local_shards_root, remote_raw_root=remote_raw_root, remote_shards_root=remote_shards_root, remote_converted_root=remote_converted_root, ) vocab_config_fp = os.path.join( os.path.dirname(binaries.__file__), f"configs/vocabulary/{asr_datasets.language}.yml" ) text_preprocessor = get_preprocessor( remote_vocabs_root=remote_vocabs_root, vocab_config_fp=vocab_config_fp, asr_datasets=asr_datasets ) sampling_rate = 16_000 max_vocab_samples = 100000 def sentence_generator(): for i, example in tqdm.tqdm(enumerate(asr_datasets.get("train", load_noise=False)), total=max_vocab_samples): if i >= max_vocab_samples: break targets = example["targets"].numpy()[0].decode() yield text_preprocessor.preprocess(targets) vocab_file, spm_file = get_sentence_piece_model( sentence_generator=sentence_generator, text_preprocessor=text_preprocessor, overwrite=False ) tokenizer = Speech2TextTokenizer(vocab_file=vocab_file, spm_file=spm_file) feature_extractor = Speech2TextFeatureExtractor(sampling_rate=sampling_rate) processor = Speech2TextProcessor( feature_extractor=feature_extractor, tokenizer=tokenizer ) save_and_eval_steps = 1 training_args = Seq2SeqTrainingArguments( output_dir=out_dir, evaluation_strategy=IntervalStrategy("steps"), save_steps=save_and_eval_steps, eval_steps=save_and_eval_steps, num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, logging_dir=log_dir, group_by_length=True, # label_smoothing_factor=1, load_best_model_at_end=True, save_total_limit=2, ) # Create the model config = Speech2TextConfig( return_dict=True, sampling_rate=sampling_rate, vocab_size=tokenizer.vocab_size, pad_token_id=processor.tokenizer.pad_token_id, bos_token_id=processor.tokenizer.bos_token_id, eos_token_id=processor.tokenizer.eos_token_id, decoder_start_token_id=processor.tokenizer.bos_token_id, ) model = Speech2TextForConditionalGeneration(config) # model = Speech2TextModel(config) model.train() train_dataset = Speech2TextTFDataset.get_split( processor=processor, text_preprocessor=text_preprocessor, datasets=asr_datasets, split="train" ) eval_dataset = Speech2TextTFDataset.get_split( processor=processor, text_preprocessor=text_preprocessor, datasets=asr_datasets, split="dev", max_samples=3 ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=Speech2TextCollator(processor=processor), compute_metrics=partial(compute_metrics, processor), callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)], ) last_checkpoint = trainer_utils.get_last_checkpoint(out_dir) trainer.train(resume_from_checkpoint=last_checkpoint) print("All done.") if __name__ == '__main__': main() ```
stefan-falk commented 2 years ago

@patrickvonplaten @anton-l unfortunately I didn't get an answer to my post in the πŸ€— forum yet. I don't know if you would mind to take a look and maybe leave some advice on this topic. Thanks for any help.

patrickvonplaten commented 2 years ago

Hey @stefan-falk,

Sorry to only reply now & thanks for pinging me again. In general training transformers-speech models from scratch is really difficult and I would strongly recommend to leverage pretrained checkpoints for an encoder-decoder setup.

Do you need to pretrain a model from scratch? What is your target task exactly? I will finish a encoder-decoder example script this week which shows how to leverage pretrained speech and text checkpoints for ASR and will try to have a colab version as well so that it's easy to follow this tutorial. I think this could help a lot - hopefully I'll be done by this week :-)

stefan-falk commented 2 years ago

Hi @patrickvonplaten !

No worries :)

Well, the reason why I'd want to train a model from scratch is because I would like to do that on custom (and non-public) datasets in different languages as well. Wav2Vec is a nice-to-have at this point. Right now I'd just be happy to be able to train any sensible model on a new dataset. In the end the goal is to use this model on mobile devices.

patrickvonplaten commented 2 years ago

Hey @stefan-falk,

I see! I think even if the model should work well only on custom (non-public) datasets, it would still make sense to leverage general pre-trained checkpoints. I'll try to have a working encoder-decoder example by the end of the week :-)

stefan-falk commented 2 years ago

Yeah, that's surely correct :)

It would be great to get an example for this! Please be so kind and ping me once it's available! :)

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stefan-falk commented 2 years ago

@patrickvonplaten Hi! Are there any news on the encoder-decoder example? :)

patrickvonplaten commented 2 years ago

We have an example here now: https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition#sequence-to-sequence

stefan-falk commented 2 years ago

@patrickvonplaten Thanks, but this seems to be an example for fine-tuning and not training from scratch.

What I am looking for is a hands on tutorial/example that shows how I can e.g. train a Speech2Text model from scratch.

The code I posted originally (see above) is running (without crashing) but looking at tensorboard I am rather convinced that there are still some issues.

It's not clear to me if I have to use model = Speech2TextForConditionalGeneration(config) or model = Speech2TextModel(config).

Can I use the Trainer or Seq2SeqTrainer?

Am I batching correctly:

@dataclass
class Speech2TextCollator:
    def __init__(self, processor: Speech2TextProcessor):
        self.processor = processor

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        inputs = [torch.Tensor(f["inputs"]) for f in features]
        targets = [torch.Tensor(f["targets"]) for f in features]
        # Create batches
        inputs_batch = pad_sequence(inputs, batch_first=True)
        targets_batch = pad_sequence(targets, batch_first=True).long()
        attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long()
        return dict(input_features=inputs_batch, attention_mask=attention_mask, labels=targets_batch)

and so on.

It would be great to have an example that guides one through details like this.

If I run the code I wrote what I get is something like this:

image

patrickvonplaten commented 2 years ago

I see - sorry we don't have any examples on how to train encoder-decoder from scratch yet for ASR. I also don't think it's a good idea given how well it works to leverage pretrained speech and text checkpoints

stefan-falk commented 2 years ago

@patrickvonplaten Okay, I see. The issue here is just that I am now reliant on the availability of pre-trained models in all the languages I want to support. For example, facebok/wav2vec2-base was only trained on English which probably does not help for languages like Chinese. Going for he cross-language model is also not an option due to its size.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

There are also the XLS-R checkpoints which have been pretrained on over 128 languages :-) https://huggingface.co/models?other=xls_r_pretrained

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mosesdaudu001 commented 3 months ago

Hey @stefan-falk

Any luck on getting the script to pretrain a Speech2Text model from scratch?