gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
29.84k stars 2.22k forks source link

Audio waveform #2706

Closed dawoodkhan82 closed 1 year ago

dawoodkhan82 commented 1 year ago

Description

Add a utility function that creates a video when supplied with audio and image files.

https://user-images.githubusercontent.com/12725292/203449101-39a603be-fe8f-4ca5-9b24-1d06ad2db812.mov

Test:

def audio_waveform(audio, image):
    print(audio.name)
    return utils.audio_to_video(audio.name, image.name)

gr.Interface(audio_waveform, inputs= [gr.Audio(type="file"), gr.Image(type="file")], outputs=gr.Video()).launch()

Please include:

Closes: # (issue)

Checklist:

A note about the CHANGELOG

Hello 👋 and thank you for contributing to Gradio!

All pull requests must update the change log located in CHANGELOG.md, unless the pull request is labeled with the "no-changelog-update" label.

Please add a brief summary of the change to the Upcoming Release > Full Changelog section of the CHANGELOG.md file and include a link to the PR (formatted in markdown) and a link to your github profile (if you like). For example, "* Added a cool new feature by [@myusername](link-to-your-github-profile) in [PR 11111](https://github.com/gradio-app/gradio/pull/11111)".

If you would like to elaborate on your change further, feel free to include a longer explanation in the other sections. If you would like an image/gif/video showcasing your feature, it may be best to edit the CHANGELOG file using the GitHub web UI since that lets you upload files directly via drag-and-drop.

github-actions[bot] commented 1 year ago

All the demos for this PR have been deployed at https://huggingface.co/spaces/gradio-pr-deploys/pr-2706-all-demos

dawoodkhan82 commented 1 year ago

thoughts on adding this as a flag to the audio or video component?

dawoodkhan82 commented 1 year ago

Also tried to make this look prettier, but I'm limited with what can be done with ffmpeg. (https://trac.ffmpeg.org/wiki/Waveform)

abidlabs commented 1 year ago

Awesome @dawoodkhan82! A couple of thoughts:

image image
abidlabs commented 1 year ago

Oh just saw your comment about ffmpeg's limitation, that's okay then imo! cc @aliabid94 @pngwn @gary149 for your thoughts as well!

dawoodkhan82 commented 1 year ago

Awesome @dawoodkhan82! A couple of thoughts:

  • What about adding a image parameter to Audio component, which if supplied, makes the Audio component appear like this?
  • The waveform looks a little jagged / sharp to me. Is it possible to make it a little more discrete by putting little bars instead? Here are a couple of examples of what imo looks a little bit nicer:
image image

That would require our frontend audio component to also be a video component. Which is fine imo.

pngwn commented 1 year ago

That would require our frontend audio component to also be a video component. Which is fine imo.

I disagree, I think we just change the backend to generate an video component in the config when those options are passed. I don't want the audio component being a video component as well, that doesn't make any sense and means we need to ship more unnecessary code (at least until we have more granular on demand loading).

abidlabs commented 1 year ago

We shouldn’t change the component type in the config — among other things, it would break Interface.load(), which recreates an Interface based on its config. But would it be possible in this case for the Audio component to “import” the code for the Video component and pass along the video in the frontend to avoid duplicating code?

pngwn commented 1 year ago

Duplicate code isn't the issue, unused code is. The video would be shipped for anyone using the Audio, even if they weren't using the 'to video' functionality. We can fix this in the future when we support more granular on demand loading.

aliabid94 commented 1 year ago

So I've been a lotta time looking into this, and this is where I'm at.

If we use the FFMPEG library's built-in waveform filter, then we get the waveform @dawoodkhan82 posted above, it takes a somewhat reasonable time to run (maybe 5 seconds for 30 seconds of audio) but it doesn't look very good. I think it actually looks bad enough to hurt virality. I played around with the FFMPEG arguments and couldn't get it looking much better, though I think if I can at least make the wave form symmetric across the horizontal axis through some clever FFMPEG manipulation, it might look a little better.

An alternative is to write some custom python code to generate a custom waveform. I looked into this, and was able to a very nice looking animated waveform. However this is much slower, about as slow as requiring half a second per second of audio to render. So this would take about 18 seconds of extra processing to render a waveform video for a 30 second audio clip. This is obviously unacceptable. The reason it's so slow is because we have to save an image for every single frame of the video, and then we use FFMPEG to compile all the images into a single video.

A third solution might be to have a static image with a static waveform representing the whole audio clip combined into a video. Nothing would animate over the course of the audio playing. A benefit would be rendering would be extremely fast. We could also render any custom waveform logic. It would be a shareable video. However a static image is not as aesthetic as an animation obviously.

Let me know what you guys think.

abidlabs commented 1 year ago

I am inclined towards option 3, as I think it's best to keep things simple and not add too much inference time. If we could have a single audio waveform representing the entire audio file overlayed on top of a static image, and it looks nice, let's go for it. Can you share an example of what that would look like?

Maybe we could add some flair on top of that, like as the video plays, it colors the part of the waveform that you are at. But again, I would only do that if it's fairly simple and if it improves aesthetics/shareability

freddyaboulton commented 1 year ago

I also think option 3 is the way to go but seeing examples of all three would help evaluate the trade off !

aliabid94 commented 1 year ago

Okay, here it is.

For a simple waveform:

import gradio as gr

def audio_waveform(audio):
    return gr.Waveform(audio)

gr.Interface(
    audio_waveform, gr.Audio(type="filepath"), gr.Audio()
).launch()

which creates (click play on the recordings below):

Recording 2022-12-06 at 17 02 28

You can customize the waveform visual in many ways, such as adding background image, number and styling of bars, etc.

Example:

import gradio as gr

def audio_waveform(audio, image):
    return gr.Waveform(audio, bg_image=image)

gr.Interface(
    audio_waveform,
    inputs=[gr.Audio(type="filepath"), gr.Image(type="filepath")],
    outputs=gr.Audio(),
).launch()

Recording 2022-12-06 at 17 03 21

It is very fast, ~1s to render the video. This is because the video is simply an image, (static waveform ) and audio, so it is light and quick to render. The playback animation seen above of the gray bar crossing the video is actually rendered via JS, so it will not exist as part of any copied / shared video.

We needed to create a separate frontend Waveform,svelte component to render the playback animation.

abidlabs commented 1 year ago

I'm a little confused about this Python API:

import gradio as gr

def audio_waveform(audio, image):
    return gr.Waveform(audio, bg_image=image)

gr.Interface(
    audio_waveform,
    inputs=[gr.Audio(type="filepath"), gr.Image(type="filepath")],
    outputs=gr.Audio(),
).launch()

Why are you returning a gr.Waveform from the function? Shouldn't the function return gr.update(audio, bg_image=image)

github-actions[bot] commented 1 year ago

The demo notebooks don't match the run.py files. Please run this command from the root of the repo and then commit the changes:

pip install nbformat && cd demo && python generate_notebooks.py
github-actions[bot] commented 1 year ago

The demo notebooks don't match the run.py files. Please run this command from the root of the repo and then commit the changes:

pip install nbformat && cd demo && python generate_notebooks.py
aliabid94 commented 1 year ago

Okay made animation part of ffmpeg video generation. The default way of configuring a waveform is by setting it up in the Audio component constructor. e.g.

gr.Audio(waveform=True)

or

gr.Audio(waveform=gr.Waveform(bar_color="green", bar_count="200", bg_image="cat.jpg"))

however if you wish to update the waveform in the function, you cannot use gr.update() because we currently do not have anywhere we store updated config values for a session in the backend, only the frontend. So to return an updated waveform (e.g. to change the bg_image), you must return a new waveform

or

return gr.Waveform(audio="audio.wav,  bg_image="cat.jpg")

demo/waveform/run.py has examples of all of these. See below Recording 2022-12-09 at 00 37 37

abidlabs commented 1 year ago

Looks beautiful @aliabid94! Taking a closer look, but I wanted to clarify what you meant by this comment:

however if you wish to update the waveform in the function, you cannot use gr.update() because we currently do not have anywhere we store updated config values for a session in the backend, only the frontend

Why do you need to update anything on the backend?

abidlabs commented 1 year ago

Trying it out and I'm not sure if I am understanding the Python API correctly, but shouldn't this produce a waveform:

with gr.Blocks() as demo:
    gr.Audio("cantina.wav", waveform=True)

demo.launch()

It doesn't for me

abidlabs commented 1 year ago

Sharing my thoughts on the Python API in this PR, I think it's a little bit confusing and overcomplicated for the most common use cases. Here's my suggestion for how to make it easier, while still retaining the cool functionality that you've included here:

Use case 1: someone wants to make a quick and dirty waveform to display a static audio file with a background image.

Suggested API:

with gr.Blocks() as demo:
    gr.Audio("test.wav", image="stars.png")
demo.launch()

In other words, if an image is supplied, then a waveform is automatically created with default parameters

Use case 2: you want to show a generated audio file with a fixed image

gr.Interface(fn, "text", gr.Audio(image="stars.png")).launch()

So image is a parameter like any other parameter and gets saved to the config

Use case 3: you want to show a generated audio file with randomly selected images


def fn(text):
  ....
  return gr.update("audio.wav", image=random_image)

gr.Interface(fn, "text", "audio").launch()

So you can update image like any other parameter in the config

Use case 4: you want to customize the waveform

This is an advanced use case so it's a little more complex:

gr.Interface(fn, "text", gr.Audio(image="stars.png", waveform=gr.Waveform(fg_alpha=0.5))).launch()

But this is the only case you need to know about the gr.Waveform class.

The other advantage of this API is that the types are a lot clearer. In the current API, value can be a lot of different things and it's quite confusing.


All this being said, the actual functionality is fantastic -- great waveforms & animation is preserved when downloaded. Happy to do some more testing once we converge on the API

github-actions[bot] commented 1 year ago

The demo notebooks don't match the run.py files. Please run this command from the root of the repo and then commit the changes:

pip install nbformat && cd demo && python generate_notebooks.py
aliabid94 commented 1 year ago

Okay update:

you can set wavefoem to True, a background image, or a waveform object. Added ability to save configuration updates in the backend, so we now support gr.update with waveform.

Example code showing all possible features:

def audio_waveform(audio, image):
    return (
        audio,
        audio,
        audio,
        audio,
        gr.Audio.update(
            value=audio,
            waveform=gr.Waveform(bg_image=image, bars_color=random.choice(COLORS)),
        ),
    )

gr.Interface(
    audio_waveform,
    inputs=[gr.Audio(type="filepath"), gr.Image(type="filepath")],
    outputs=[
        gr.Audio(), # basic output
        gr.Audio(waveform=True), # autogenerated waveform
        gr.Audio(waveform="drake.jpg"), # waveform with set background image
        gr.Audio(
            waveform=gr.Waveform( # custom waveform
                bars_color=("#00ff00", "#0011ff"), bar_count=100, bg_color="#000000"
            )
        ),
        gr.Audio(waveform=True), # waveform is updated with gr.Audio.update
    ],
).launch()
freddyaboulton commented 1 year ago

@aliabid94 This is working great!

I'm wondering if we can avoid modifying the postprocess signature of every component in order to ship it.

If users want to update their waveform via gr.Audio.update, can we make something like the following work?

def update_waveform(audio):
    new_audio = gr.WaveForm(image="drake.jpg").make_video(audio)
    return gr.Audio.update(value=new_audio)

It is more code that users have to write but I think it's fine to expect users to do the processing in their event handlers and that way we don't have to modify all the component postprocess functions.

It's confusing that the front-end audio component is now serving video as well. I think the original approach where we just provide a helper class to convert audio to waveform (but it's on the user to add a gr.Video component) is a lot simpler architecturally. How come we moved away from that?

abidlabs commented 1 year ago

I'm wondering if we can avoid modifying the postprocess signature of every component in order to ship it.

+1. We should keep .postprocess() method signature as straightforward as possible especially as people will be writing their own for custom components.

It's confusing that the front-end audio component is now serving video as well. I think the original approach where we just provide a helper class to convert audio to waveform (but it's on the user to add a gr.Video component) is a lot simpler architecturally. How come we moved away from that?

This will lead to some limitations -- for example, it won't be possible to extract the audio and use it as the input for a different component. But overall, this might be the saner solution (and we might not even need a gr.Waveform class -- we could just generate a waveform gr.make_waveform(**params) which generates a Video)

aliabid94 commented 1 year ago

But we DO want preprocess and postprocess to be able to take into account updated configuration, right? for example with the Radio element and updated choices. To do that, we'd need to update these methods.

abidlabs commented 1 year ago

But we DO want preprocess and postprocess to be able to take into account updated configuration, right? for example with the Radio element and updated choices. To do that, we'd need to update these methods.

We definitely need to update the methods, but is the best approach changing the method signatures? A related issue that I see here is that we only save certain fields in the state dictionary, but I feel like this can lead to bugs where users forget to add certain configuration parameters to the state dictionary.

What about the following approach instead: what if every component stores all of its parameters in a state dictionary, where the key is the session hash and the value is a dict of config parameters that have been updated in that session. In preprocess() and postprocess(), we don't access the class attributes directly, instead we call get_config() to get the component parameters for that particular state?

aliabid94 commented 1 year ago

we could do something along those lines, but I feel like it'd make more sense to take a second argument called configuration, which takes all the component configurations., rather than a get_config() method e.g.

class Radio:
  def postprocess(y, config):
    if config.type == "index":
      return config.choices.index(y):
    else:
      return y

why don't we want to modify the method signature exactly?

freddyaboulton commented 1 year ago

Yea I think storing the latest state in the front-end is a great feature we'll add in the future but I'm not sure if it's the best way to ship the waveform component. I think the simplest thing to do to close this out is the gr.make_waveform(**params) approach @abidlabs commented above. We could hash out the design details/implementation of storing state in the backend in a separate issue/PR.

abidlabs commented 1 year ago

we could do something along those lines, but I feel like it'd make more sense to take a second argument called configuration, which takes all the component configurations., rather than a get_config() method e.g.

Yeah I like this more. I didn't want a second parameter that users would have to reason about as they were designing their component, but config is pretty intuitive IMO.

aliabid94 commented 1 year ago

So that's what I had initially, I had a return gr.Waveform type solution and @abidlabs didn't like that (and I agreed), because so far we've always had return types be standard data types, and things related to display configuration are set in the component constructor. That's why I changed to a gr.update solution. I agree this is a structural change that requires discussion, but let's just have that discussion now? Especially since this is needed to solve other bugs like the gr.Radio and choices update issue we mentioned earlier.

abidlabs commented 1 year ago

So that's what I had initially, I had a return gr.Waveform type solution and @abidlabs didn't like that (and I agreed), because so far we've always had return types be standard data types, and things related to display configuration are set in the component constructor. That's why I changed to a gr.update solution.

Actually what @freddyaboulton is suggesting is a bit different -- he's not saying return the gr.Waveform (which we were opposed to) but instead create a utility function (say gr.make_waveform()) that returns a video, which we pass into the Video component. Are you opposed to this solution?

aliabid94 commented 1 year ago

So the output component is now Video, rather than audio?

abidlabs commented 1 year ago

Exactly

aliabid94 commented 1 year ago

ok sounds good, I think that works for now

github-actions[bot] commented 1 year ago

The demo notebooks don't match the run.py files. Please run this command from the root of the repo and then commit the changes:

pip install nbformat && cd demo && python generate_notebooks.py
github-actions[bot] commented 1 year ago

The demo notebooks don't match the run.py files. Please run this command from the root of the repo and then commit the changes:

pip install nbformat && cd demo && python generate_notebooks.py
aliabid94 commented 1 year ago

ok implemented @freddyaboulton's suggestion, now a much simpler PR. we can figure out backend configuration later.

def audio_waveform(audio, image):
    return (
        audio,
        gr.make_waveform(audio),
        gr.make_waveform(audio, bg_image=image, bars_color=random.choice(COLORS)),
    )

gr.Interface(
    audio_waveform,
    inputs=[gr.Audio(), gr.Image(type="filepath")],
    outputs=[
        gr.Audio(),
        gr.Video(),
        gr.Video(),
    ],
).launch()
aliabid94 commented 1 year ago

ready for rereview

abidlabs commented 1 year ago

@aliabid94 works and looks beautiful! I added some comments above. We should make sure this is documented well in the docs page as it's a user-facing function. In addition, I think it would be good to write a unit test for it so we don't break it in the future.

abidlabs commented 1 year ago

By the way, we can remove the Waveform.svelte now, correct? And presumably the changes to pnpm-lock.yaml?

aliabid94 commented 1 year ago

Added docs and removed frontend files @abidlabs, and added a test

abidlabs commented 1 year ago

Why did we switch to camelcase instead of the original more pythonic gr.make_waveform?

abidlabs commented 1 year ago

LGTM, awesome @aliabid94!