nateraw / stable-diffusion-videos

Create 🔥 videos with Stable Diffusion by exploring the latent space and morphing between text prompts
Apache License 2.0
4.34k stars 413 forks source link

Error with making a music video (av.error.ValueError: [Errno 22] Invalid argument) #190

Closed ndelib closed 1 year ago

ndelib commented 1 year ago

Hi @nateraw, thanks for your great work on this package.

I'm currently struggling with generating a video synchronised with an mp3 file. I've set up my Python environment as per your requirements.txt file and getting the following error: Traceback (most recent call last): File "make_music_video.py", line 21, in video_path = pipeline.walk( File "/home/ubuntu/venv/lib/python3.8/site-packages/stable_diffusion_videos/stable_diffusion_pipeline.py", line 867, in walk make_video_pyav( File "/home/ubuntu/venv/lib/python3.8/site-packages/stable_diffusion_videos/stable_diffusion_pipeline.py", line 130, in make_video_pyav write_video( File "/home/ubuntu/venv/lib/python3.8/site-packages/torchvision/io/video.py", line 124, in write_video for packet in a_stream.encode(frame): File "av/stream.pyx", line 164, in av.stream.Stream.encode File "av/codec/context.pyx", line 482, in av.codec.context.CodecContext.encode File "av/audio/codeccontext.pyx", line 42, in av.audio.codeccontext.AudioCodecContext._prepare_frames_for_encode File "av/audio/resampler.pyx", line 101, in av.audio.resampler.AudioResampler.resample File "av/filter/graph.pyx", line 211, in av.filter.graph.Graph.push File "av/filter/context.pyx", line 89, in av.filter.context.FilterContext.push File "av/error.pyx", line 336, in av.error.err_check av.error.ValueError: [Errno 22] Invalid argument

The error seems to occur after the code has already generated all the image frames and is attempting to prepare the final video file (my assumption but could be mistaken).

I tried to resolve the issue by attempting various permutations of the package versions in requirements.txt, including locking all versions to a release prior to Jan 7th when you last updated the requirements.txt. This sometimes changed the error to a different one (i.e. changing the version of librosa), but didn't ultimately resolve it.

My environment: Python: 3.8.10 Pip: 23.1.2 (had to upgrade to the latest to properly install that basicsr package, as per my comment on https://github.com/nateraw/stable-diffusion-videos/issues/170). OS: "Ubuntu 20.04.5 LTS" CUDA version: (nvcc --version output) nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

This is actually a VM instance spun up from lambda labs, which gives you cheap access to a pretty powerful GPU cloud (like $0.60 / hour for this instance type). I mention that because one idea that might be helpful for users would be to have a proven way to set this package up in a standard, known environment -- maybe that could be one of these lambda labs VMs? Just a thought, I'm not affiliated with them in any way.

That way everyone is working off the same environment & there's an option for people like me who don't mind spending some pocket change to not deal with dependency hell :p

nateraw commented 1 year ago

Not sure about the install issue, will look into it asap (next day or so).

As for setup, yea that's a great idea. That's what the colab was about, but I use lambda too. Next time I spin up an instance I'll sort out the install + add lambda specific instructions to readme. if you have time, feel free to issue a PR with these instructions and I'll give it a try :)

ndelib commented 1 year ago

Legend @nateraw thanks buddy! I'd be happy to help fill out some instructions for lambda VM setup but I haven't actually gotten the music video mode working there yet.

This morning I spent 2 hours or so trying some more things and sadly ended up bumping into the same error (i.e. I tried replicating the exact conda set up you mentioned you used in a few other threads, and I also tried locking torchvision to the previous 0.14.1 version, etc).

Whenever you have the time, let me know if you get it working in that context, and I'll happily help with a PR however I can.

nateraw commented 1 year ago

I'm taking a look on a fresh A100 instance from lambda 😎 will let ya know how it goes

nateraw commented 1 year ago

Few issues going on.

  1. Gradio fails due to old fsspec version. Brought it up here just now: https://github.com/gradio-app/gradio/issues/2626
  2. The version of protobuf installed by default on lambda is mismatched with transformers I guess? Had to downgrade.
  3. Some issue with VAE/Scheduler I had set as default in the example files when initializing the pipelines. Need to init them differently. Will update to simpler example so this issue goes away.

To set up on new Lambda A10 instance to make music videos

Install deps. I added xformers cuz it speeds things up (40 sec per batch in example below to 26 on A10).

pip install --upgrade fsspec
pip install stable-diffusion-videos protobuf==3.20.* youtube-dl
pip install xformers

Get a song

youtube-dl -f bestaudio --extract-audio --audio-format mp3 --audio-quality 0 -o "music/thoughts.%(ext)s" https://soundcloud.com/nateraw/thoughts

Then you can run:

import random

import torch
from stable_diffusion_videos import StableDiffusionWalkPipeline

pipe = StableDiffusionWalkPipeline.from_pretrained(
    'runwayml/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    safety_checker=None,
).to("cuda")

# Comment the line below if you do not have xformers installed.
pipe.enable_xformers_memory_efficient_attention()

# I give you permission to scrape this song :)
# youtube-dl -f bestaudio --extract-audio --audio-format mp3 --audio-quality 0 -o "music/thoughts.%(ext)s" https://soundcloud.com/nateraw/thoughts
audio_filepath = 'music/thoughts.mp3'

# Seconds in the song. Here we slice the audio from 0:07-0:13
# Should be same length as prompts/seeds.
audio_offsets = [7, 10, 13]

# Output video frames per second.
# Use lower values for testing (5-10ish), higher values for better quality (30 or 60)
fps = 4  # Change back to 25-30ish, 4 is for testing

# Convert seconds to frames
# This array should be `len(prompts) - 1` as its steps between prompts.
num_interpolation_steps = [(b-a) * fps for a, b in zip(audio_offsets, audio_offsets[1:])]

prompts = ["a cat with a funny hat", "snoop dogg at the dmv", "steak flavored ice cream"]
seeds = [random.randint(0, 9e9) for _ in range(len(prompts))]

pipe.walk(
    prompts=prompts,
    seeds=seeds,
    num_interpolation_steps=num_interpolation_steps,
    fps=fps,
    audio_filepath=audio_filepath,
    audio_start_sec=audio_offsets[0],
    batch_size=12,  # Increase/decrease based on available GPU memory. This fits on 24GB A10
    num_inference_steps=50,
    guidance_scale=15,
    margin=1.0,
    smooth=0.2,
)

Will add this stuff to the repo. :) let me know if you give it a shot.

ndelib commented 1 year ago

You're a wizard @nateraw!! Thanks a lot. Can confirm the above worked for me on a Lambda A10 :)

Only minor thing to note is upgrading the pip version was still required (perhaps that's worth adding in the README when you next can). Also I'm still a little confused about the audio_offsets - I'm interpreting them as the point in the song when each prompt should "kick in" (so the differences between offsets are the duration of each prompt). However, the mp4 generated from your above example is only 6 seconds long, not sure if this is intentional. All good, that's for me to figure out, thanks again!