[Help]: MultipleContentsSVC Whisper feature extraction

open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

https://openhlt.github.io/amphion/

MIT License

4.45k stars 379 forks source link

[Help]: MultipleContentsSVC Whisper feature extraction #127

Closed Darius-H closed 7 months ago

Darius-H commented 7 months ago

In MultipleContentsSVC, whisper will pad or truncate the original audio (like n seconds, n<30) to 30s to get the feature with shape: (batch, 1500, 1024), should we just truncate the feature to feature=feature[:,:int(1500/30*n),:]

Adorable-Qin commented 7 months ago

Hi @Darius-H !

Amphion will only store valid frames of the feature, i.e. what you said like feature[:, :valid frames, :]. So you don't have to worry about the padding zeros, because we have removed them after extracting the features and saving the compressed features into files.

If you have any other questions, feel free to contact us!

RMSnow commented 7 months ago

Hi @Darius-H , if you have any further questions about whisper features, feel free to re-open this issue. We are glad to follow up!