Closed stefan-falk closed 2 years ago
@patrickvonplaten @anton-l unfortunately I didn't get an answer to my post in the π€ forum yet. I don't know if you would mind to take a look and maybe leave some advice on this topic. Thanks for any help.
Hey @stefan-falk,
Sorry to only reply now & thanks for pinging me again. In general training transformers-speech models from scratch is really difficult and I would strongly recommend to leverage pretrained checkpoints for an encoder-decoder setup.
Do you need to pretrain a model from scratch? What is your target task exactly? I will finish a encoder-decoder example script this week which shows how to leverage pretrained speech and text checkpoints for ASR and will try to have a colab version as well so that it's easy to follow this tutorial. I think this could help a lot - hopefully I'll be done by this week :-)
Hi @patrickvonplaten !
No worries :)
Well, the reason why I'd want to train a model from scratch is because I would like to do that on custom (and non-public) datasets in different languages as well. Wav2Vec is a nice-to-have at this point. Right now I'd just be happy to be able to train any sensible model on a new dataset. In the end the goal is to use this model on mobile devices.
Hey @stefan-falk,
I see! I think even if the model should work well only on custom (non-public) datasets, it would still make sense to leverage general pre-trained checkpoints. I'll try to have a working encoder-decoder example by the end of the week :-)
Yeah, that's surely correct :)
It would be great to get an example for this! Please be so kind and ping me once it's available! :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@patrickvonplaten Hi! Are there any news on the encoder-decoder example? :)
We have an example here now: https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition#sequence-to-sequence
@patrickvonplaten Thanks, but this seems to be an example for fine-tuning and not training from scratch.
What I am looking for is a hands on tutorial/example that shows how I can e.g. train a Speech2Text
model from scratch.
The code I posted originally (see above) is running (without crashing) but looking at tensorboard I am rather convinced that there are still some issues.
It's not clear to me if I have to use model = Speech2TextForConditionalGeneration(config)
or model = Speech2TextModel(config)
.
Can I use the Trainer
or Seq2SeqTrainer
?
Am I batching correctly:
@dataclass
class Speech2TextCollator:
def __init__(self, processor: Speech2TextProcessor):
self.processor = processor
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
inputs = [torch.Tensor(f["inputs"]) for f in features]
targets = [torch.Tensor(f["targets"]) for f in features]
# Create batches
inputs_batch = pad_sequence(inputs, batch_first=True)
targets_batch = pad_sequence(targets, batch_first=True).long()
attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long()
return dict(input_features=inputs_batch, attention_mask=attention_mask, labels=targets_batch)
and so on.
It would be great to have an example that guides one through details like this.
If I run the code I wrote what I get is something like this:
I see - sorry we don't have any examples on how to train encoder-decoder from scratch yet for ASR. I also don't think it's a good idea given how well it works to leverage pretrained speech and text checkpoints
@patrickvonplaten Okay, I see. The issue here is just that I am now reliant on the availability of pre-trained models in all the languages I want to support. For example, facebok/wav2vec2-base
was only trained on English which probably does not help for languages like Chinese. Going for he cross-language model is also not an option due to its size.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
There are also the XLS-R checkpoints which have been pretrained on over 128 languages :-) https://huggingface.co/models?other=xls_r_pretrained
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @stefan-falk
Any luck on getting the script to pretrain a Speech2Text
model from scratch?
π Feature request
Fine-tuning is rather straight forward but it looks to me as if running a training from scratch isn't. I am rather new to π€ but from what I've learned to far is that it's rather tricky to get by and find out how to start a new
Speech2Text
training (for example).We got
run_wav2vec2_pretraining_no_trainer.py
in order to train a newWav2Vec2
model from scratch but I wonder why this is (explicitly) not using theTrainer
API? Is there any particular reason?Motivation
After running into out-of-memory issues during
Wav2Vec2
trainings I figured it would be better to use a smaller model for this purpose. Since training an end-to-end model usingWav2Vec2
requires multiple stages I thought it would be better to start with a simpleSpeech2Text
transformer model and continue from there. However, up until now I am unable to properly run a training. For some reason the word-error-rate is basically 0% from the start only to get worse over time to a point where the model is not predicting anything anymore. I have no explanation for this but you can take a look at the code that (in a sense) brought me here.Code (click to expand)
```python import json import os from dataclasses import dataclass from functools import partial from typing import List, Dict, Union import torch import tqdm from torch.nn.utils.rnn import pad_sequence from torch.utils.data import IterableDataset from transformers import TrainingArguments, Trainer, trainer_utils, Speech2TextTokenizer, Speech2TextFeatureExtractor, \ Speech2TextProcessor, Speech2TextConfig, Speech2TextModel, Speech2TextForConditionalGeneration, Seq2SeqTrainer, \ IntervalStrategy, EarlyStoppingCallback, Seq2SeqTrainingArguments import sentencepiece as spm import tensorflow as tf from .speech.bin.hf_train import get_dataset, get_preprocessor from .speech.data.speech_dataset import SpeechRecognitionDatasets from .speech import bin as binaries from .speech.lab.training.metrics import error_rate import numpy as np class Speech2TextTFDataset(IterableDataset): def __init__(self, processor: Speech2TextProcessor, text_preprocessor, dataset: tf.data.Dataset, num_samples: int = None): self.processor = processor self.text_preprocessor = text_preprocessor self.dataset = dataset self.num_samples = num_samples def __len__(self): if self.num_samples is None: raise RuntimeError("Number of samples is unknown.") return self.num_samples def __getitem__(self, item): raise NotImplementedError def __iter__(self): for example in self.dataset: inputs = example["inputs"] targets = example["targets"].numpy()[0].decode() targets = self.text_preprocessor.preprocess(targets) sampling_rate = self.processor.feature_extractor.sampling_rate # Extract features & target labels audio_features = self.processor.feature_extractor(inputs, sampling_rate=sampling_rate)["input_features"][0] labels = self.processor.tokenizer.encode(targets) size, _ = audio_features.shape attention_mask = torch.ones(size) yield dict(inputs=audio_features, targets=labels, attention_mask=attention_mask) @classmethod def get_split(cls, processor, text_preprocessor, datasets: SpeechRecognitionDatasets, split: str, max_samples=None): dataset = datasets.get(split, load_noise=False) if split == "train": dataset = dataset.repeat() if max_samples is not None: dataset = dataset.take(max_samples) num_samples = datasets.get_num_speech_samples(split) return cls(processor, text_preprocessor, dataset, num_samples=num_samples) @dataclass class Speech2TextCollator: def __init__(self, processor: Speech2TextProcessor): self.processor = processor def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: inputs = [torch.Tensor(f["inputs"]) for f in features] targets = [torch.Tensor(f["targets"]) for f in features] # Create batches inputs_batch = pad_sequence(inputs, batch_first=True) targets_batch = pad_sequence(targets, batch_first=True).long() attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long() return dict( input_features=inputs_batch, # decoder_input_ids=targets_batch, attention_mask=attention_mask, labels=targets_batch ) def compute_metrics(processor: Speech2TextProcessor, pred): # pred_logits = pred.predictions pred_ids = np.argmax(pred.predictions[0], axis=-1) pred_str = processor.batch_decode(pred_ids) # we do not want to group tokens when computing the metrics label_str = processor.batch_decode(pred.label_ids, group_tokens=False) wer = error_rate(targets=label_str, predictions=pred_str, tokens="words") cer = error_rate(targets=label_str, predictions=pred_str, tokens="characters") return {"wer": wer, "cer": cer} def get_sentence_piece_model(sentence_generator, text_preprocessor, overwrite=False): model_prefix = "/tmp/en" vocab_file = model_prefix + ".json" spm_file = model_prefix + ".model" if os.path.exists(vocab_file) and os.path.exists(spm_file) and not overwrite: return vocab_file, spm_file text_fp = "/tmp/spm.txt" with open(text_fp, "w") as f: for sentence in sentence_generator(): text = sentence.strip() text = text_preprocessor.preprocess(text) f.write(text) f.write("\n") spm.SentencePieceTrainer.Train( input=text_fp, vocab_size=1000, model_prefix=model_prefix, user_defined_symbols=["