haoheliu / versatile_audio_super_resolution

Versatile audio super resolution (any -> 48kHz) with AudioSR.
MIT License
1.11k stars 107 forks source link

Does not work in Google Colab on music WAVs #3

Open asigalov61 opened 1 year ago

asigalov61 commented 1 year ago

@RetroCirce @haoheliu

Hello, guys !!! :)

Thank you for publishing this work. It looks very promising and the samples are very good too.

I need your audiosr for my music WAVs but it does not work in Google Colab.

Please see attached WAV that produces the following traceback on A100 40GB:

Very Nauty Violin.zip

Loading AudioSR: basic
Loading model on cuda:0
/usr/local/lib/python3.10/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torchaudio/transforms/_transforms.py:611: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module.
  warnings.warn(
DiffusionWrapper has 258.20 M params.
/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py:237: RuntimeWarning: divide by zero encountered in divide
  "sqrt_recip_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod))
/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py:240: RuntimeWarning: divide by zero encountered in divide
  "sqrt_recipm1_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod - 1))
/usr/local/lib/python3.10/dist-packages/audiosr/utils.py:109: FutureWarning: Pass sr=48000, n_fft=2048, n_mels=256, fmin=20, fmax=24000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel = librosa_mel_fn(sampling_rate, filter_length, n_mel, mel_fmin, mel_fmax)
Running DDIM Sampling with 200 timesteps
DDIM Sampler:   0% 0/200 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/audiosr", line 107, in <module>
    waveform = super_resolution(
  File "/usr/local/lib/python3.10/dist-packages/audiosr/pipeline.py", line 167, in super_resolution
    waveform = latent_diffusion.generate_batch(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py", line 1524, in generate_batch
    samples, _ = self.sample_log(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py", line 1430, in sample_log
    samples, intermediates = ddim_sampler.sample(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddim.py", line 143, in sample
    samples, intermediates = self.ddim_sampling(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddim.py", line 237, in ddim_sampling
    outs = self.p_sample_ddim(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddim.py", line 293, in p_sample_ddim
    model_t = self.model.apply_model(x_in, t_in, c)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py", line 1029, in apply_model
    x_recon = self.model(x_noisy, t, cond_dict=cond)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py", line 1685, in forward
    out = self.diffusion_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/modules/diffusionmodules/openaimodel.py", line 879, in forward
    h = th.cat([h, concate_tensor], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 928 but got size 927 for tensor number 1 in the list.
haoheliu commented 1 year ago

Now it should be fixed in the up to date version

asigalov61 commented 1 year ago

@haoheliu Thanks. It works on some files now but I still get an error on the following file:

Moments.zip

Loading AudioSR: basic
Loading model on cuda:0
/usr/local/lib/python3.10/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torchaudio/transforms/_transforms.py:611: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module.
  warnings.warn(
DiffusionWrapper has 258.20 M params.
/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py:237: RuntimeWarning: divide by zero encountered in divide
  "sqrt_recip_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod))
/usr/local/lib/python3.10/dist-packages/audiosr/latent_diffusion/models/ddpm.py:240: RuntimeWarning: divide by zero encountered in divide
  "sqrt_recipm1_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod - 1))
Warning: audio is longer than 10.24 seconds, may degrade the model performance. It's recommand to truncate your audio to 5.12 seconds before input to AudioSR to get the best performance.
Traceback (most recent call last):
  File "/usr/local/bin/audiosr", line 107, in <module>
    waveform = super_resolution(
  File "/usr/local/lib/python3.10/dist-packages/audiosr/pipeline.py", line 164, in super_resolution
    batch, duration = make_batch_for_super_resolution(input_file, waveform=waveform)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/pipeline.py", line 83, in make_batch_for_super_resolution
    log_mel_spec, stft, waveform, duration, target_frame = read_audio_file(input_file)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/utils.py", line 208, in read_audio_file
    waveform, target_frame, duration = read_wav_file(filename)
  File "/usr/local/lib/python3.10/dist-packages/audiosr/utils.py", line 204, in read_wav_file
    waveform = pad_wav(waveform, target_length=int(48000 * pad_duration))
  File "/usr/local/lib/python3.10/dist-packages/audiosr/utils.py", line 63, in pad_wav
    temp_wav[:, rand_start : rand_start + waveform_length] = waveform
ValueError: could not broadcast input array from shape (1,7127040) into shape (1,7127039)
asigalov61 commented 1 year ago

@haoheliu @RetroCirce Any update on this bug? I tried it today it still does not work. Thank you.

haoheliu commented 1 year ago

Sorry for the error. Will look into that later this week. @asigalov61

asigalov61 commented 1 year ago

@haoheliu Thanks. Please let me know if you will be able to fix it and I will try it again.

WyrmSpear commented 7 months ago

I got this error today trying to upscale my 300 sec audiocraft wavs... so I asked Gemini if it could help... this is what it gave me. Absolutely! I can help you code a splitter/stitcher for your long audiocraft fork files. Here's a Python script that you can use:

import os
from pydub import AudioSegment

# Set these variables
INPUT_DIR = "path/to/your/input/folder"  # Folder containing your long audio files
OUTPUT_DIR = "path/to/your/output/folder"  # Folder to save split files
MAX_DURATION_SECONDS = 10  # Maximum duration per split file (adjust as needed)

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

for filename in os.listdir(INPUT_DIR):
    # Get full path and check if it's an audio file
    filepath = os.path.join(INPUT_DIR, filename)
    if not filename.endswith((".wav", ".mp3", ".flac")):
        continue

    # Load audio segment
    audio = AudioSegment.from_file(filepath)

    # Calculate number of splits needed
    num_splits = int(audio.duration_seconds / MAX_DURATION_SECONDS) + 1

    # Split and save audio segments
    for i in range(num_splits):
        start_time = i * MAX_DURATION_SECONDS * 1000
        end_time = min((i + 1) * MAX_DURATION_SECONDS * 1000, audio.duration_seconds * 1000)
        segment = audio[start_time:end_time]
        output_filename = os.path.join(OUTPUT_DIR, f"{filename.split('.')[0]}_{i}.{filename.split('.')[-1]}")
        segment.export(output_filename, format=filename.split('.')[-1])

print(f"Split and saved {num_splits} segments from {filename}!")

Explanation:

  1. Import libraries: pydub is used for audio manipulation.
  2. Set variables: Adjust INPUT_DIR, OUTPUT_DIR, and MAX_DURATION_SECONDS as needed.
  3. Loop through files:
    • Check if the file is an audio file.
    • Load the audio segment.
    • Calculate the number of splits based on MAX_DURATION_SECONDS.
    • Loop through each split:
      • Define start and end times.
      • Extract the segment.
      • Create an output filename with unique numbering.
      • Export the segment to the output directory.
  4. Print confirmation message: Shows the number of splits created for each file.

Important Notes:

Remember to install pydub and adjust the variables before running the script. Let me know if you have any questions or need further assistance!

asigalov61 commented 7 months ago

@WyrmSpear Thanks, I will try it out.