guyyariv / TempoTokens

This repo contains the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/
MIT License
101 stars 10 forks source link

Example inference command does not work #1

Closed SoftologyPro closed 12 months ago

SoftologyPro commented 12 months ago

Your example inference.py --mapper_weights models/vggsound/learned_embeds.pth --audio_path /audio/path I try the following python inference.py --mapper_weights models\vggsound\learned_embeds.pth --audio_path croaking.mp3 which gives this error

usage: inference.py [-h] -m MODEL --mapper_weights MAPPER_WEIGHTS [-p PROMPT] [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR] [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES] [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS] [-g GUIDANCE_SCALE] [-i INIT_VIDEO]
                    [-iw INIT_WEIGHT] [-f FPS] [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw] [-l] [-r SEED] [--n N] [--testset TESTSET] [--audio_path AUDIO_PATH]
inference.py: error: the following arguments are required: -m/--model

What do I need to specify for the --model parameter? Thanks for any tips.

guyyariv commented 12 months ago

I appreciate the update. The model parameter you should use is the pre-trained text-to-video model that we fine-tuned. The model compatible with the provided mapper_weights is "cerspense/zeroscope_v2_576w". I've set it as the default parameter, so feel free to give it a try now.

SoftologyPro commented 12 months ago

Getting further, but now...

  File "D:\Tests\TempoTokens\inference.py", line 519, in <module>
    videos = inference(
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Tests\TempoTokens\inference.py", line 400, in inference
    prompt_embeds, negative_prompt_embeds = compel(prompt), compel(negative_prompt) if negative_prompt else None
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\compel.py", line 135, in __call__
    output = self.build_conditioning_tensor(text_input)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\compel.py", line 112, in build_conditioning_tensor
    conditioning, _ = self.build_conditioning_tensor_for_conjunction(conjunction)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\compel.py", line 186, in build_conditioning_tensor_for_conjunction
    this_conditioning, this_options = self.build_conditioning_tensor_for_prompt_object(p)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\compel.py", line 218, in build_conditioning_tensor_for_prompt_object
    return self._get_conditioning_for_flattened_prompt(prompt), {}
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\compel.py", line 282, in _get_conditioning_for_flattened_prompt
    return self.conditioning_provider.get_embeddings_for_weighted_prompt_fragments(
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\embeddings_provider.py", line 120, in get_embeddings_for_weighted_prompt_fragments
    base_embedding = self.build_weighted_embedding_tensor(tokens, per_token_weights, mask, device=device)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\embeddings_provider.py", line 371, in build_weighted_embedding_tensor
    z = self._encode_token_ids_to_embeddings(chunk_token_ids, chunk_attention_mask)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\compel\embeddings_provider.py", line 390, in _encode_token_ids_to_embeddings
    text_encoder_output = self.text_encoder(token_ids,
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\TempoTokens\modules\text_encoder\modeling_clip_temp_token.py", line 855, in forward
    return self.text_model(
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\TempoTokens\modules\text_encoder\modeling_clip_temp_token.py", line 760, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids, audio_token=audio_token, temp_token=temp_token, local_windows=local_windows)
  File "D:\Tests\TempoTokens\voc_tempotokens\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\TempoTokens\modules\text_encoder\modeling_clip_temp_token.py", line 232, in forward
    indices = torch.where(input_ids == 49408)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Using Torch 2.0.1, ie pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118

guyyariv commented 12 months ago

The token embedding wasn't resizing properly during inference, leading to the CUDA error. I've fixed this issue, so you can now generate videos without encountering that error. Thanks for reporting it!

SoftologyPro commented 12 months ago

Yes, all good now. Thanks.