Open devilismyfriend opened 1 year ago
Sure. We will open-source that part, which is also in the TODO list.
Could you possibly just send the Audio Super Resolution model you used so that we don't have to download the dataset and train ourselves?
@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.
Awesome! Excited to test it out
On Tue, Feb 21, 2023, 2:56 p.m. haoheliu @.***> wrote:
@galfaroth https://github.com/galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.
— Reply to this email directly, view it on GitHub https://github.com/haoheliu/AudioLDM/issues/4#issuecomment-1439198792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYC4ICC7NC7NXAYRMPKIU3WYVB2HANCNFSM6AAAAAAURAJXVY . You are receiving this because you authored the thread.Message ID: @.***>
@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.
No way!
Hi all, the code related to super-resolution and inpainting is available here: https://github.com/haoheliu/AudioLDM/blob/main/audioldm/pipeline.py#L223
It has not been integrated into the command line usage yet because I haven't come up with an elegant and simple interface. I'm just trying to avoid making this tool exceedingly heavy. And maybe super-resolution and inpainting are not that of board interest from my perspective (correct me if I'm wrong). So I'll temporarily leave super-resolution and inpainting in this python function form. You can still play with the function though. I've already tested it out and it all works fine.
Hey, I tried using the new method:
def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
seed=random_seed,ddim_steps=steps,
duration=duration, batchsize=1,
guidance_scale=guidance_scale,
n_candidate_gen_per_text=int(n_candidates),
time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting,
freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
)
if(len(waveform) == 1):
waveform = waveform[0]
return waveform
but then I get:
[<ipython-input-11-eac161f8fca7>](https://localhost:8080/#) in upsample(original_filepath, text, duration, guidance_scale, random_seed, n_candidates, steps)
8
9 def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
---> 10 waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
11 seed=random_seed,ddim_steps=steps,
12 duration=duration, batchsize=1,
[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in super_resolution_and_inpainting(latent_diffusion, text, original_audio_file_path, seed, ddim_steps, duration, batchsize, guidance_scale, n_candidate_gen_per_text, time_mask_ratio_start_and_end, freq_mask_ratio_start_and_end, config)
258 )
259
--> 260 batch = make_batch_for_text_to_audio(text, fbank=mel[None,...], batchsize=batchsize)
261
262 # latent_diffusion.latent_t_size = duration_to_latent_t_size(duration)
[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in make_batch_for_text_to_audio(text, waveform, fbank, batchsize)
26 else:
27 fbank = torch.FloatTensor(fbank)
---> 28 fbank = fbank.expand(batchsize, 1024, 64)
29 assert fbank.size(0) == batchsize
30
RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1024, 64]. Tensor sizes: [1, 512, 64]
I know the base SR = 16000, where do I specify the target SR? Can it upscale to 96000 for example?
@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.
@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.
Apart from upsample resolution, why do I get the error? Can you post an example of how to do the upsampling with this method?
You can use the following script (sr_inpainting.py) @galfaroth
#!/usr/bin/python3
import os
from audioldm import text_to_audio, style_transfer, build_model, save_wave, get_time, super_resolution_and_inpainting
import argparse
CACHE_DIR = os.getenv(
"AUDIOLDM_CACHE_DIR",
os.path.join(os.path.expanduser("~"), ".cache/audioldm"))
parser = argparse.ArgumentParser()
parser.add_argument(
"-t",
"--text",
type=str,
required=False,
default="",
help="Text prompt to the model for audio generation",
)
parser.add_argument(
"-f",
"--file_path",
type=str,
required=False,
default=None,
help="(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio",
)
parser.add_argument(
"--transfer_strength",
type=float,
required=False,
default=0.5,
help="A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text",
)
parser.add_argument(
"-s",
"--save_path",
type=str,
required=False,
help="The path to save model output",
default="./output",
)
parser.add_argument(
"-ckpt",
"--ckpt_path",
type=str,
required=False,
help="The path to the pretrained .ckpt model",
default=os.path.join(
CACHE_DIR,
"audioldm-s-full.ckpt",
),
)
parser.add_argument(
"-b",
"--batchsize",
type=int,
required=False,
default=1,
help="Generate how many samples at the same time",
)
parser.add_argument(
"--ddim_steps",
type=int,
required=False,
default=200,
help="The sampling step for DDIM",
)
parser.add_argument(
"-gs",
"--guidance_scale",
type=float,
required=False,
default=2.5,
help="Guidance scale (Large => better quality and relavancy to text; Small => better diversity)",
)
parser.add_argument(
"-dur",
"--duration",
type=float,
required=False,
default=10.0,
help="The duration of the samples",
)
parser.add_argument(
"-n",
"--n_candidate_gen_per_text",
type=int,
required=False,
default=3,
help="Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation",
)
parser.add_argument(
"--seed",
type=int,
required=False,
default=42,
help="Change this value (any integer number) will lead to a different generation result.",
)
args = parser.parse_args()
assert args.duration % 2.5 == 0, "Duration must be a multiple of 2.5"
mode = "super_resolution_and_inpainting"
save_path = os.path.join(args.save_path, mode)
if(args.file_path is not None):
save_path = os.path.join(save_path, os.path.basename(args.file_path.split(".")[0]))
text = args.text
random_seed = args.seed
duration = args.duration
guidance_scale = args.guidance_scale
n_candidate_gen_per_text = args.n_candidate_gen_per_text
os.makedirs(save_path, exist_ok=True)
audioldm = build_model(ckpt_path=args.ckpt_path)
waveform = super_resolution_and_inpainting(
audioldm,
text,
args.file_path,
random_seed,
duration=duration,
guidance_scale=guidance_scale,
ddim_steps=args.ddim_steps,
n_candidate_gen_per_text=n_candidate_gen_per_text,
batchsize=args.batchsize,
time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
# time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
# freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
)
save_wave(waveform, save_path, name="%s_%s" % (get_time(), text))
in the command line, run this script by:
python3 sr_inpainting.py -f trumpet.wav
Then the script will do inpainting on audio between 10% to 15% time steps.
Hey! Thanks for the reply! What if I wanted to test the super resolution? Can you provide an example too? And possibly sample in and out example.
omg it's happening
Hi @galfaroth,
Just modify this parameter freq_mask_ratio_start_and_end
in @haoheliu 's sample code.
You can spend a little time to understand this repo. it's a good investt.
Would love to see code to reproduce the paper's super resolution