haoheliu / AudioLDM

AudioLDM: Generate speech, sound effects, music and beyond, with text.
https://audioldm.github.io/
Other
2.33k stars 222 forks source link

Super resolution example? #4

Open devilismyfriend opened 1 year ago

devilismyfriend commented 1 year ago

Would love to see code to reproduce the paper's super resolution

haoheliu commented 1 year ago

Sure. We will open-source that part, which is also in the TODO list.

galfaroth commented 1 year ago

Could you possibly just send the Audio Super Resolution model you used so that we don't have to download the dataset and train ourselves?

haoheliu commented 1 year ago

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

devilismyfriend commented 1 year ago

Awesome! Excited to test it out

On Tue, Feb 21, 2023, 2:56 p.m. haoheliu @.***> wrote:

@galfaroth https://github.com/galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

— Reply to this email directly, view it on GitHub https://github.com/haoheliu/AudioLDM/issues/4#issuecomment-1439198792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYC4ICC7NC7NXAYRMPKIU3WYVB2HANCNFSM6AAAAAAURAJXVY . You are receiving this because you authored the thread.Message ID: @.***>

galfaroth commented 1 year ago

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

No way!

haoheliu commented 1 year ago

Hi all, the code related to super-resolution and inpainting is available here: https://github.com/haoheliu/AudioLDM/blob/main/audioldm/pipeline.py#L223

It has not been integrated into the command line usage yet because I haven't come up with an elegant and simple interface. I'm just trying to avoid making this tool exceedingly heavy. And maybe super-resolution and inpainting are not that of board interest from my perspective (correct me if I'm wrong). So I'll temporarily leave super-resolution and inpainting in this python function form. You can still play with the function though. I've already tested it out and it all works fine.

galfaroth commented 1 year ago

Hey, I tried using the new method:

def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
  waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
                                  seed=random_seed,ddim_steps=steps,
                                  duration=duration, batchsize=1,
                                  guidance_scale=guidance_scale,
                                  n_candidate_gen_per_text=int(n_candidates),
                                  time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting,
                                  freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
                                  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform

but then I get:

[<ipython-input-11-eac161f8fca7>](https://localhost:8080/#) in upsample(original_filepath, text, duration, guidance_scale, random_seed, n_candidates, steps)
      8 
      9 def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
---> 10   waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
     11                                   seed=random_seed,ddim_steps=steps,
     12                                   duration=duration, batchsize=1,

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in super_resolution_and_inpainting(latent_diffusion, text, original_audio_file_path, seed, ddim_steps, duration, batchsize, guidance_scale, n_candidate_gen_per_text, time_mask_ratio_start_and_end, freq_mask_ratio_start_and_end, config)
    258     )
    259 
--> 260     batch = make_batch_for_text_to_audio(text, fbank=mel[None,...], batchsize=batchsize)
    261 
    262     # latent_diffusion.latent_t_size = duration_to_latent_t_size(duration)

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in make_batch_for_text_to_audio(text, waveform, fbank, batchsize)
     26     else:
     27         fbank = torch.FloatTensor(fbank)
---> 28         fbank = fbank.expand(batchsize, 1024, 64)
     29         assert fbank.size(0) == batchsize
     30 

RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1024, 64]. Tensor sizes: [1, 512, 64]

I know the base SR = 16000, where do I specify the target SR? Can it upscale to 96000 for example?

haoheliu commented 1 year ago

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

galfaroth commented 1 year ago

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

Apart from upsample resolution, why do I get the error? Can you post an example of how to do the upsampling with this method?

haoheliu commented 1 year ago

You can use the following script (sr_inpainting.py) @galfaroth

#!/usr/bin/python3
import os
from audioldm import text_to_audio, style_transfer, build_model, save_wave, get_time, super_resolution_and_inpainting
import argparse

CACHE_DIR = os.getenv(
    "AUDIOLDM_CACHE_DIR",
    os.path.join(os.path.expanduser("~"), ".cache/audioldm"))

parser = argparse.ArgumentParser()

parser.add_argument(
    "-t",
    "--text",
    type=str,
    required=False,
    default="",
    help="Text prompt to the model for audio generation",
)

parser.add_argument(
    "-f",
    "--file_path",
    type=str,
    required=False,
    default=None,
    help="(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio",
)

parser.add_argument(
    "--transfer_strength",
    type=float,
    required=False,
    default=0.5,
    help="A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text",
)

parser.add_argument(
    "-s",
    "--save_path",
    type=str,
    required=False,
    help="The path to save model output",
    default="./output",
)

parser.add_argument(
    "-ckpt",
    "--ckpt_path",
    type=str,
    required=False,
    help="The path to the pretrained .ckpt model",
    default=os.path.join(
                CACHE_DIR,
                "audioldm-s-full.ckpt",
            ),
)

parser.add_argument(
    "-b",
    "--batchsize",
    type=int,
    required=False,
    default=1,
    help="Generate how many samples at the same time",
)

parser.add_argument(
    "--ddim_steps",
    type=int,
    required=False,
    default=200,
    help="The sampling step for DDIM",
)

parser.add_argument(
    "-gs",
    "--guidance_scale",
    type=float,
    required=False,
    default=2.5,
    help="Guidance scale (Large => better quality and relavancy to text; Small => better diversity)",
)

parser.add_argument(
    "-dur",
    "--duration",
    type=float,
    required=False,
    default=10.0,
    help="The duration of the samples",
)

parser.add_argument(
    "-n",
    "--n_candidate_gen_per_text",
    type=int,
    required=False,
    default=3,
    help="Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation",
)

parser.add_argument(
    "--seed",
    type=int,
    required=False,
    default=42,
    help="Change this value (any integer number) will lead to a different generation result.",
)

args = parser.parse_args()
assert args.duration % 2.5 == 0, "Duration must be a multiple of 2.5"

mode = "super_resolution_and_inpainting"

save_path = os.path.join(args.save_path, mode)

if(args.file_path is not None):
    save_path = os.path.join(save_path, os.path.basename(args.file_path.split(".")[0]))

text = args.text
random_seed = args.seed
duration = args.duration
guidance_scale = args.guidance_scale
n_candidate_gen_per_text = args.n_candidate_gen_per_text

os.makedirs(save_path, exist_ok=True)
audioldm = build_model(ckpt_path=args.ckpt_path)

waveform = super_resolution_and_inpainting(
    audioldm,
    text,
    args.file_path,
    random_seed,
    duration=duration,
    guidance_scale=guidance_scale,
    ddim_steps=args.ddim_steps,
    n_candidate_gen_per_text=n_candidate_gen_per_text,
    batchsize=args.batchsize,
    time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
    # time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
    # freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
    freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
)

save_wave(waveform, save_path, name="%s_%s" % (get_time(), text))

in the command line, run this script by:

python3 sr_inpainting.py -f trumpet.wav

Then the script will do inpainting on audio between 10% to 15% time steps.

galfaroth commented 1 year ago

Hey! Thanks for the reply! What if I wanted to test the super resolution? Can you provide an example too? And possibly sample in and out example.

bitnom commented 1 year ago

omg it's happening

Hikari-Tsai commented 1 year ago

Hi @galfaroth, Just modify this parameter freq_mask_ratio_start_and_end in @haoheliu 's sample code. You can spend a little time to understand this repo. it's a good investt.