Error: "The size of tensor a (64) must match the size of tensor b (72) at non-singleton dimension 3" when doing img2img

Hey there, thanks a lot for the repo man!

My goal is to do audio-to-audio with a text prompt using this banana-riffusion repo. More specifically, I want to pass in a techno-sounding bass guitar; also send a text prompt like "cello"; and then getting back new base64 data that represents that techno bass guitar, but with a "cello" text prompt applied to it. Here's a screenshot of Riffusion's streamlit app showing which UI I want to programmatically access:

Screenshot 2023-03-20 at 9 50 46 AM

I tried to do this by passing in a mask_image_id parameter pointing to those rock and roll drums in the request, and something broke. I'm trying to use a file called bass.png that exists in ~/bass.png as the mask_image. I created bass.png using the audio-to-image command from the Riffusion CLI. I then uploaded it manually to the server using scp and put it in the ~/seed_images folder.

Unfortunately, I got the following error when trying to do this. I have this bana-riffusion repo running on an AWS GPU in the cloud. I'm not running this repo inside a Docker container there, just starting it with the following command:

python -m riffusion.server --port 8000 --host 0.0.0.0

I use the AWS GPU for testing this repo, but my prod env obviously hooks up to a banana-hosted instance of this repo. Using an AWS GPU for dev lets me cut down on dev time, since I don't have to wait for the Banana pipeline's artifact-build or coldstarts. I have not yet tested if I'd get this error in the banana-hosted instance of this repo

Here's the request I'm sending. I send it to http://{{GPU_IP}}:3013:

{
  "alpha": 0.75,
  "num_inference_steps": 50,
  "seed_image_id": "og_beat",
  "start": {
    "prompt": "cello",
    "seed": 42,
    "denoising": 0.9,
    "guidance": 7.0
  },
  "end": {
    "prompt": "tribal drums",
    "seed": 42,
    "denoising": 0.9,
    "guidance": 7.0
  },
  "mask_image_id": "bass"
}

Here's the error I'm getting back:

INFO:root:{'alpha': 0.75, 'num_inference_steps': 50, 'seed_image_id': 'og_beat', 'start': {'prompt': 'tribal drums', 'seed': 42, 'denoising': 0.9, 'guidance': 7.0}, 'end': {'prompt': 'tribal drums', 'seed': 42, 'denoising': 0.9, 'guidance': 7.0}, 'mask_image_id': 'bass'}
  0%|                                                                                                                              | 0/46 [00:00<?, ?it/s]
ERROR:server:Exception on / [POST]
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/ubuntu/rapbot-banana-riffusion/riffusion/server.py", line 146, in run_inference
    response = compute(inputs)
  File "/home/ubuntu/rapbot-banana-riffusion/riffusion/server.py", line 176, in compute
    image = MODEL.riffuse(inputs, init_image=init_image, mask_image=mask_image)
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/rapbot-banana-riffusion/riffusion/riffusion_pipeline.py", line 149, in riffuse
    outputs = self.interpolate_img2img(
  File "/home/ubuntu/miniconda3/envs/riffusion-inference/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/rapbot-banana-riffusion/riffusion/riffusion_pipeline.py", line 303, in interpolate_img2img
    latents = (init_latents_proper * mask) + (latents * (1 - mask))
RuntimeError: The size of tensor a (64) must match the size of tensor b (72) at non-singleton dimension 3
INFO:werkzeug:100.11.148.105 - - [20/Mar/2023 13:36:33] "POST / HTTP/1.1" 500 -

Any help please? Does the mask_image_id parameter not represent one of the images in an img2img/audio_to_audio conversion? Can this repo not do audio_to_audio, and I have to create my own banana-riffusion to achieve that?

Again, really amazing repo, saved me about 3 weeks of dev time if I'd have had to put riffusion together for banana myself. Thanks!

sahil280114 / banana-riffusion

Error: "The size of tensor a (64) must match the size of tensor b (72) at non-singleton dimension 3" when doing img2img #1