sensein / senselab

senselab is a Python package that simplifies building pipelines for biometric (e.g. speech, voice, video, etc) analysis.
http://sensein.group/senselab/
Apache License 2.0
10 stars 3 forks source link

Task: StyleTTS2 Support and Dockerized Functionality #143

Open wilke0818 opened 3 months ago

wilke0818 commented 3 months ago

Description

Currently StyleTTS2 cannot be supported through TorchHub as originally thought because of its dependency on espeak which is not directly a Python dependency but rather a system dependency. Code currently exists trying to integrate StyleTTS with TorchHub but the tests fail if the local environment doesn't have this dependency (this wasn't originally noticed because @wilke0818's had espeak locally).

This task/issue also raises the idea for creating a more generalizable approach to incorporating functionality that can't be directly integrated with Python, namely by using Pydra and Docker containers.

Tasks

Freeform Notes

No response

fabiocat93 commented 1 week ago

@wilke0818 do you mind writing 2 lines here on why this was paused/we (at least temporarily) gave up with this? thanks!

satra commented 1 week ago

cc:ing @900miles as he is playing with styletts2 currently.

900miles commented 1 week ago

Looks like we can remove the dependency on espeak with 2 forked packages, but they report lower quality than the one using espeak. This would also remove GPL-licensed code if that is relevant.

https://pypi.org/project/styletts2/ https://github.com/NeuralVox/StyleTTS2

fabiocat93 commented 1 week ago

Looks like we can remove the dependency on espeak with 2 forked packages, but they report lower quality than the one using espeak. This would also remove GPL-licensed code if that is relevant.

https://pypi.org/project/styletts2/ https://github.com/NeuralVox/StyleTTS2

any chance we can listen to some audio clips and evaluate the quality ourselves?

900miles commented 1 week ago

Yeah working on that right now

wilke0818 commented 1 week ago

I've also used their Colabs and uploaded my own audios to it and found similar issues to other TTS models with high pitch screeching and generally not matching the target audio: https://colab.research.google.com/github/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Demo_LibriTTS.ipynb

900miles commented 1 week ago

Weird, I've had really good results with that demo. What target voices are you using?

wilke0818 commented 1 week ago

Fabio's

900miles commented 1 week ago

With the pip package (the second voice is Bob Ross). audio_tests.zip

fabiocat93 commented 1 week ago

With the pip package (the second voice is Bob Ross). audio_tests.zip

these sound pretty good. are they generated with or without espeak?

900miles commented 1 week ago

Without

On Tue, Nov 19, 2024, 4:25 PM Fabio Catania @.***> wrote:

With the pip package (the second voice is Bob Ross). audio_tests.zip https://github.com/user-attachments/files/17820270/audio_tests.zip

these sound pretty good. are they generated with or without espeak?

— Reply to this email directly, view it on GitHub https://github.com/sensein/senselab/issues/143#issuecomment-2486784392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUHWUERX4A7JOYBICCKT7D2BOUFVAVCNFSM6AAAAABMQ7ICBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBWG44DIMZZGI . You are receiving this because you were assigned.Message ID: @.***>

fabiocat93 commented 1 week ago

Without On Tue, Nov 19, 2024, 4:25 PM Fabio Catania @.> wrote: With the pip package (the second voice is Bob Ross). audio_tests.zip https://github.com/user-attachments/files/17820270/audio_tests.zip these sound pretty good. are they generated with or without espeak? — Reply to this email directly, view it on GitHub <#143 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUHWUERX4A7JOYBICCKT7D2BOUFVAVCNFSM6AAAAABMQ7ICBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBWG44DIMZZGI . You are receiving this because you were assigned.Message ID: @.>

wow. i would vote for @900miles 's idea to integrate styletts2 without espeak. objections?

wilke0818 commented 1 week ago

Without On Tue, Nov 19, 2024, 4:25 PM Fabio Catania @.> wrote: With the pip package (the second voice is Bob Ross). audio_tests.zip https://github.com/user-attachments/files/17820270/audio_tests.zip these sound pretty good. are they generated with or without espeak? — Reply to this email directly, view it on GitHub <#143 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUHWUERX4A7JOYBICCKT7D2BOUFVAVCNFSM6AAAAABMQ7ICBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBWG44DIMZZGI . You are receiving this because you were assigned.Message ID: @.>

Check your environment to make sure that espeak isn't there/being used. I thought I didn't have it when I first developed the StyleTTS API but it was there/I only found out through Colab I think.

900miles commented 1 week ago

From the pypi page:

Currently using MIT-licensed gruut as the IPA phoneme converter. Found it to be the best alternative to phoneme converters based on espeak

It sounds like one motivation for this fork of the original package is to not use espeak because it is GPL licensed. So that shouldn't be an issue.

Two things I will mention before committing with this package is that it has a lot of dependencies (although many are probably overlapping with other senselab dependencies), and there is an open pull request to "fix: high-severity vulnerability in nltk 3.8.1" that hasn't had activity on it since September. So I don't think it is being actively developed. And to clarify it is not by the same authors as Style-TTS2, which I if I remember correctly was a consideration when integrating the whisperx pypi package into senselab, as that had the same issue.

fabiocat93 commented 1 week ago

From the pypi page:

Currently using MIT-licensed gruut as the IPA phoneme converter. Found it to be the best alternative to phoneme converters based on espeak

It sounds like one motivation for this fork of the original package is to not use espeak because it is GPL licensed. So that shouldn't be an issue.

Two things I will mention before committing with this package is that it has a lot of dependencies (although many are probably overlapping with other senselab dependencies), and there is an open pull request to "fix: high-severity vulnerability in nltk 3.8.1" that hasn't had activity on it since September. So I don't think it is being actively developed. And to clarify it is not by the same authors as Style-TTS2, which I if I remember correctly was a consideration when integrating the whisperx pypi package into senselab, as that had the same issue.

Indeed, we should be careful with the packages we integrate. It might be best to get the best of both solutions. We can start by testing @wilke0818 's solution with the same audio clips you tried ( @900miles ) to check if it provides similar quality results. If it does, we can proceed with that implementation and avoid using espeak.