steveway / papagayo-ng

Papagayo is a lip-syncing program designed to help you line up phonemes (mouth shapes) with the actual recorded sound of actors speaking. Papagayo makes it easy to lip sync animated characters by making the process very simple - just type in the words being spoken (or copy/paste them from the animation's script), then drag the words on top of the sound's waveform until they line up with the proper sounds.
http://steveway.github.io/papagayo-ng/
18 stars 3 forks source link

Using Transformers and Wav2Vec for automatic generation #41

Open steveway opened 1 year ago

steveway commented 1 year ago

So, Machine Learning did get quite a boost over the past few months. I did some testing and as an alternative to allosaurus we might be able to use wav2vec. Here is an example I have which seems to be able to get the CMU Phonemes we use with timestamps using a Wav2Vec model from Huggingface Transformers:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
  from datasets import load_dataset
  import torch
  import soundfile as sf

  # load model and processor
  processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-timit-phoneme")
  model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-timit-phoneme")

  # Read and process the input
  audio_input, sample_rate = sf.read("./Tutorial Files/lame.wav")
  inputs = processor(audio_input, sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
      logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

  # Decode id into string
  predicted_ids = torch.argmax(logits, axis=-1)
  predicted_sentences = processor.batch_decode(predicted_ids, output_char_offsets=True)
  time_offset = model.config.inputs_to_logits_ratio / 16000
  print(predicted_sentences)
  ipa_to_cmu = {
      "b": "B",
      "ʧ": "CH",
      "d": "D",
      "ð": "DH",
      "f": "F",
      "g": "G",
      "h": "H",
      "ʤ": "JH",
      "k": "K",
      "l": "L",
      "m": "M",
      "ŋ": "NG",
      "n": "NG",
      "p": "P",
      "r": "R",
      "s": "S",
      "ʃ": "SH",
      "t": "T",
      "θ": "TH",
      "v": "V",
      "w": "W",
      "j": "Y",
      "z": "Z",
      "ʒ": "ZH",
      "ɑ": "AA",
      "æ": "AE",
      "ə": "AH",
      "ʌ": "AH",
      "ɔ": "AO",
      "ɛ": "EH",
      "ɚ": "ER",
      "ɝ": "ER",
      "ɪ": "IH",
      "i": "IY",
      "ʊ": "UH",
      "u": "UW",
      "aʊ": "AW",
      "aɪ": "AY",
      "eɪ": "EY",
      "oʊ": "OW",
      "o": "OW",
      "ɔɪ": "OY",
      "e": "EH",
      "a": "AA",
      "ʔ": "rest",
      "ɒ": "AO",
      "ɯ": "UW",
      "ɹ": "R",
      "ɾ": "R",
      "ɹ̩": "ER",
      "ɻ": "R",
      "-": "rest",
      "ɡ": "G",
      "x": "N",
      "d͡ʒ": "JH",
      "t͡ʃ": "CH"
  }
  cmu_phones = ""
  cmu_list = []
  print(predicted_sentences.char_offsets)
  for char in predicted_sentences.char_offsets[0]:
      if char["char"] in ipa_to_cmu:
          cmu_phones += ipa_to_cmu[char["char"]] + " "

          cmu_list.append({"char": ipa_to_cmu[char["char"]],
                           "start_time": char["start_offset"] * time_offset,
                           "end_time": char["end_offset"] * time_offset})
      else:
          print("missing")
          print(char["char"])
  print(cmu_list)

This would be using this model here: https://huggingface.co/vitouphy/wav2vec2-xls-r-300m-timit-phoneme And if we are already using transformers and wav2vec then we could use that at the same time to get human readable text. For that we could even use OpenAI Whisper, the results are good, but it does not include phonemes or timestamps afaik. Also this might be good to, if we can extract timestamps too: https://huggingface.co/bookbot/wav2vec2-ljspeech-gruut

Hunanbean commented 1 year ago

Should i expand the phoneme / viseme set to 44 from 39? I just found this https://www.dyslexia-reading-well.com/44-phonemes-in-english.html, and i notice that you are having to use the same phonemes a couple of times to cover the bases. I am pretty sure i could add these to the face sets and the MHX block for MH EDIT: I will start working on these. Perhaps they will be of use.

steveway commented 1 year ago

Sure that might be useful for some. Also, some of those AI models seem to be trained on the Timit phoneme set. Those are 61 phonemes. https://catalog.ldc.upenn.edu/docs/LDC93S1/ Having a preset for that might be nice, but I'm not sure about the practical applications. Good conversions to the other sets with less phonemes should be useful.

Hunanbean commented 1 year ago

I think that makes a lot more sense than trying to chase phonemes/visemes that are just not different enough to notice