hydrusbeta / hay_say_ui

A unified, browser-based interface for pony voice generation
Apache License 2.0
38 stars 3 forks source link

CLI support #15

Open kronuschan opened 7 months ago

kronuschan commented 7 months ago

Hey there, is it possible to interface with the models through CLI? If so, how do I go about doing that? I'd like to generate audio programmatically if possible.

hydrusbeta commented 7 months ago

Hello kronuschan,

It is possible on a local installation, though a bit complex and limited. If you are trying to programmatically generate audio with my server at https://haysay.ai, however, I cannot provide a reliable API.

Why there's no API for the public server:

The Hay Say UI was built using Dash (https://dash.plotly.com), which constructs both the WSGI server and the client code from a single codebase. I don't explicitly code how the client talks to the server - that is managed internally by Dash. I looked online and through the documentation but could not find any info on the interface Dash generates for client-to-server communication, so, unfortunately, we can't simply use the Dash-generated API (well, we could, but it would take some time to figure out the details and I'm sure it's subject to change).

The Local Installation approach:

On local installations, you can punch a hole through Docker and call the generate() web service method I created for each architecture. First, you will need to identify the ID and Name of the main container. Execute the following command:

docker container ls

Under the "IMAGE" column, one of the rows will have the value "hydrusbeta/hay_say:hay_say_ui". Note the values under the CONTAINER ID and NAMES columns - you will need those in a moment. Here's what it looks like on my machine: image

Each architecture has a generate() webservice method you can call with an HTTP POST from within the main container: http://controllable_talknet_server:6574/generate http://so_vits_svc_3_server:6575/generate http://so_vits_svc_4_server:6576/generate http://so_vits_svc_5_server:6577/generate http://rvc_server:6578/generate http://styletts_2_server:6580/generate

One way to call these methods is to use curl. To execute any command on the main Docker container, you can prefix the command with docker exec -it <value from the NAMES column>. For a concrete example, the following will generate audio on my local machine using Controllable TalkNet:

docker exec -it hay_say_ui-hay_say_ui-1 curl -X POST -H "Content-Type:application/json" --data \
"{ \
  \"Inputs\": { \
    \"User Text\": \"I am the very model of a pony major general\", \
    \"User Audio\": null \
  }, \
  \"Options\": { \
    \"Disable Reference Audio\": true, \
    \"Character\": \"Apple Bloom\", \
    \"Pitch Factor\": 0, \
    \"Auto Tune\": false, \
    \"Reduce Metallic Sound\": false \
  }, \
  \"Output File\": \"my_output\", \
  \"GPU ID\": \"none\", \
  \"Session ID\": null \
}" \
http://controllable_talknet_server:6574/generate

The JSON you need to pass in the body of the request is a little different for each webservice method. You can find the JSON schemas in the parse_inputs method of each architecture's server codebase: Controllable Talknet: https://github.com/hydrusbeta/controllable_talknet_server/blob/main/main.py so-vits-svc-3: https://github.com/hydrusbeta/so_vits_svc_3_server/blob/main/main.py so-vits-svc-4: https://github.com/hydrusbeta/so_vits_svc_4_server/blob/main/main.py so-vits-svc-5: https://github.com/hydrusbeta/so_vits_svc_5_server/blob/main/main.py RVC: https://github.com/hydrusbeta/rvc_server/blob/main/main.py StyleTTS2: https://github.com/hydrusbeta/styletts_2_server/blob/main/main.py

Once the audio is generated, you need to copy it out of the Docker container. You can do that with docker cp. The form of the command is like this:

docker cp <CONTAINER_ID>:<path_on_container> <path_on_local_computer>

For <path_on_container>, use /home/luna/hay_say/audio_cache/output/<output_file_you_specified_in_the_json>.flac So, for example, I can copy the generated file out of the container to my Desktop with:

docker cp 984e08912658:/home/luna/hay_say/audio_cache/output/my_output.flac ~/Desktop/

For architectures that require a reference audio, you can likewise use docker cp to copy an audio file to the main Docker container. The file must be in flac format. Place it in /home/luna/hay_say/audio_cache/preprocessed/. For example, here's how I would generate audio with so-vits-svc 5:

  1. Copy the reference audio into the container
    docker cp ~/Desktop/my_input.flac 984e08912658:/home/luna/hay_say/audio_cache/preprocessed/my_input.flac
  2. Execute the generate/ webservice method
    docker exec -it hay_say_ui-hay_say_ui-1 curl -X POST -H "Content-Type:application/json" --data \
    "{ \
      \"Inputs\": { \
        \"User Audio\": \"my_input\" \
      }, \
      \"Options\": { \
        \"Pitch Shift\": 12, \
        \"Character\": \"Diamond Tiara\" \
      }, \
      \"Output File\": \"my_output_2\", \
      \"GPU ID\": \"none\", \
      \"Session ID\": null \
    }" \
    http://so_vits_svc_5_server:6577/generate
  3. Copy the generated audio out
    docker cp 984e08912658:/home/luna/hay_say/audio_cache/output/my_output_2.flac ~/Desktop/

Sorry, I know this hasn't been such a great answer. Hay Say wasn't originally designed with a programmatic API in mind. Things should be much better in the future, however. I wasn't planning on announcing it just yet, but since it's so relevant to this topic... I am in the midst of working on a complete rewrite of Hay Say from the ground up. Unlike its predecessor, Hay Say 2.0 will be designed from the start with a programmatic API, with the backend written in the Django-Ninja framework (which automatically generates OpenAPI/Swagger documentation). It's at least a few months away, but I wanted to let you know that that's comming.