Captioning issues on Mac M1

Hello!

I'm trying to use BLIP to caption images on a Mac with M1 processor, using the MPS torch backend.

It is running (as in it executes, processes the images and returns a caption), but the results are not great.

I'm using this image for testing and here are the results for a few scenarios:

num_beams=12, max_length=20, min_length=5, repetition_penalty=1.0: the these these these paris paris paris paris paris the the a a a a a
num_beams=12, max_length=20, min_length=5, repetition_penalty=1.2: thepy these someone paris marks there chillingux sunni 「 sunset sunriseils types an
num_beams=1, max_length=20, min_length=5, repetition_penalty=1.2: the these paris someone sunni chilling thereux paris the the a a a a a

And there are some other variations, but in principle, it repeats words, the caption has some valid words, but the sentences don't make sense and there are some tokens returned by text_decoder.generate() that seem to "stick" to other words, in the examples above two of those are ##py (in the second case, which created the thepy word) and in the second and last one ##ux (which made chillingux and thereux respectively).

I tested the text encoder and decoder and they are working as expected, the issue seems to be happening during the call to .generate(). This happens with every image I test with.

Anyone has any idea what could be the problem and if there's a solution?

salesforce / BLIP

Captioning issues on Mac M1 #131