Any examples on serving Speech2Text models from Huggingface, such as Wav2Vec2 ?

thangld201 commented 2 years ago

🚀 The feature

As far as I know, there are no examples or documentation on serving Speech2Text models from Huggingface, such as Wav2Vec2. How could I enable serving with Wav2Vec2 Huggingface pre-trained checkpoints ?

Motivation, pitch

I'm working on Huggingface Wav2Vec2 models' deployment, and would like serving on these models made possible. Thank you !

Alternatives

No response

Additional context

No response

thangld201 commented 2 years ago

@msaroufim Could you help me ?

msaroufim commented 2 years ago

So we have support for huggingface text models but haven't yet added support for an Audio example. You can feel free to contribute it but if you'd like me to prioritize it could you please elaborate on what you'd use it for?

thangld201 commented 2 years ago

@msaroufim Thank you for the explanation. I'm trying to deploy Huggingface Wav2Vec2 models for serving with TorchServe, but haven't found a way to do this as there are no related documentations, apparently. Could you show me one way or another that could work ? Sorry that I'm not familiar with the APIs. (I'm working on a Speech2Text service)

thangld201 commented 2 years ago

The model receives float tensors as inputs and output logits, which are then decoded to human-readable text. Since TorchServe provides supports for Huggingface language models, which also receive tensors as inputs, I figured there should somehow be a workaround for Wav2Vec2.

msaroufim commented 2 years ago

So the list of changes I'd expect are

Adding AutoModelForCTC here https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L80
Adding a tokenizer here `https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L80

I also need to better understand how the data format works here for a dataset specifically what is the the type of sample and can we call something cast_column on a non HuggingFace dataset?

dataset = load_dataset("common_voice", "en", split="train", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
dataset_iter = iter(dataset)
sample = next(dataset_iter)

# forward sample through model to get greedily predicted transcription ids
input_values = feature_extractor(sample["audio"]["array"], return_tensors="pt").input_values
logits = model(input_values).logits[0]

And we may need to add support for audio formats here but it's just binary data then we may be able to avoid this https://github.com/pytorch/serve/blob/master/ts/protocol/otf_message_handler.py#L318

In all cases like I said fastest would be to just experiment with extending the HuggingFace handler we provide. Handlers are not special, they are just regular python files which you can like this https://github.com/pytorch/serve/issues/1538

thangld201 commented 2 years ago

@msaroufim The sample above is a dict, and may look like this. But for inference we only need the ['audio']['array'] property.

{'accent': 'us',
 'age': 'fourties',
 'audio': {'array': array([0.        , 0.        , 0.        , ..., 0.00019884, 0.00045272,
         0.00025881], dtype=float32),
  'path': 'cv-corpus-6.1-2020-12-11/en/clips/common_voice_en_100038.mp3',
  'sampling_rate': 16000},
 'client_id': '04960d53cc851eeb6d93f21a09e09ab36fe16943acb226ced1211d7250ab2f1b9a1d655c1cc03d50006e396010851ad52d4c53f49dd77b080b01c4230704c68d',
 'down_votes': 0,
 'gender': 'male',
 'locale': 'en',
 'path': 'common_voice_en_100038.mp3',
 'segment': "''",
 'sentence': 'Why does Melissandre look like she wants to consume Jon Snow on the ride up the wall?',
 'up_votes': 2}

In short, the follow codes should suffice:

 import datasets, torch
dataset = load_dataset("common_voice", "en", split="train", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)) # Simply to resample audios to 16KHz, this can be treated as a preprocessing step
dataset_iter = iter(dataset)
sample = next(dataset_iter)

# forward sample through model to get greedily predicted transcription ids
with torch.no_grad():
    input_values = feature_extractor(sample["audio"]["array"], sampling_rate=16000,return_tensors="pt").input_values # float tensor
    logits = model(input_values).logits
    pred_ids = torch.argmax(logits, dim=-1)
    pred_str = processor.batch_decode(pred_ids)[0]
    print(pred_str)

The cast_column function is available for the HuggingFace datasets class only, but it is fairly simple to implement similar functions, and it's only a preprocessing step, since the Wav2Vec2 only receives 16KHz audios as inputs.

Again, thank you for the explanation! For the mean time, I will try extending the available handlers.

yscho0806 commented 2 years ago

Hi. Any updates on this issue? I am also trying to implement torch serving with wav2vec2.

pytorch / serve