Closed thangld201 closed 2 years ago
@msaroufim Could you help me ?
So we have support for huggingface text models but haven't yet added support for an Audio example. You can feel free to contribute it but if you'd like me to prioritize it could you please elaborate on what you'd use it for?
@msaroufim Thank you for the explanation. I'm trying to deploy Huggingface Wav2Vec2 models for serving with TorchServe, but haven't found a way to do this as there are no related documentations, apparently. Could you show me one way or another that could work ? Sorry that I'm not familiar with the APIs. (I'm working on a Speech2Text service)
The model receives float tensors as inputs and output logits, which are then decoded to human-readable text. Since TorchServe provides supports for Huggingface language models, which also receive tensors as inputs, I figured there should somehow be a workaround for Wav2Vec2.
So the list of changes I'd expect are
AutoModelForCTC
here https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L80I also need to better understand how the data format works here for a dataset specifically what is the the type of sample
and can we call something cast_column
on a non HuggingFace dataset?
dataset = load_dataset("common_voice", "en", split="train", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
dataset_iter = iter(dataset)
sample = next(dataset_iter)
# forward sample through model to get greedily predicted transcription ids
input_values = feature_extractor(sample["audio"]["array"], return_tensors="pt").input_values
logits = model(input_values).logits[0]
And we may need to add support for audio formats here but it's just binary data then we may be able to avoid this https://github.com/pytorch/serve/blob/master/ts/protocol/otf_message_handler.py#L318
In all cases like I said fastest would be to just experiment with extending the HuggingFace handler we provide. Handlers are not special, they are just regular python files which you can like this https://github.com/pytorch/serve/issues/1538
@msaroufim The sample
above is a dict, and may look like this. But for inference we only need the ['audio']['array']
property.
{'accent': 'us',
'age': 'fourties',
'audio': {'array': array([0. , 0. , 0. , ..., 0.00019884, 0.00045272,
0.00025881], dtype=float32),
'path': 'cv-corpus-6.1-2020-12-11/en/clips/common_voice_en_100038.mp3',
'sampling_rate': 16000},
'client_id': '04960d53cc851eeb6d93f21a09e09ab36fe16943acb226ced1211d7250ab2f1b9a1d655c1cc03d50006e396010851ad52d4c53f49dd77b080b01c4230704c68d',
'down_votes': 0,
'gender': 'male',
'locale': 'en',
'path': 'common_voice_en_100038.mp3',
'segment': "''",
'sentence': 'Why does Melissandre look like she wants to consume Jon Snow on the ride up the wall?',
'up_votes': 2}
In short, the follow codes should suffice:
import datasets, torch
dataset = load_dataset("common_voice", "en", split="train", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)) # Simply to resample audios to 16KHz, this can be treated as a preprocessing step
dataset_iter = iter(dataset)
sample = next(dataset_iter)
# forward sample through model to get greedily predicted transcription ids
with torch.no_grad():
input_values = feature_extractor(sample["audio"]["array"], sampling_rate=16000,return_tensors="pt").input_values # float tensor
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
pred_str = processor.batch_decode(pred_ids)[0]
print(pred_str)
The cast_column
function is available for the HuggingFace datasets class only, but it is fairly simple to implement similar functions, and it's only a preprocessing step, since the Wav2Vec2 only receives 16KHz audios as inputs.
Again, thank you for the explanation! For the mean time, I will try extending the available handlers.
Hi. Any updates on this issue? I am also trying to implement torch serving with wav2vec2.
🚀 The feature
As far as I know, there are no examples or documentation on serving Speech2Text models from Huggingface, such as Wav2Vec2. How could I enable serving with Wav2Vec2 Huggingface pre-trained checkpoints ?
Motivation, pitch
I'm working on Huggingface Wav2Vec2 models' deployment, and would like serving on these models made possible. Thank you !
Alternatives
No response
Additional context
No response