salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.86k stars 648 forks source link

Captioning issues on Mac M1 #131

Open victorca25 opened 1 year ago

victorca25 commented 1 year ago

Hello!

I'm trying to use BLIP to caption images on a Mac with M1 processor, using the MPS torch backend.

It is running (as in it executes, processes the images and returns a caption), but the results are not great.

I'm using this image for testing and here are the results for a few scenarios:

And there are some other variations, but in principle, it repeats words, the caption has some valid words, but the sentences don't make sense and there are some tokens returned by text_decoder.generate() that seem to "stick" to other words, in the examples above two of those are ##py (in the second case, which created the thepy word) and in the second and last one ##ux (which made chillingux and thereux respectively).

I tested the text encoder and decoder and they are working as expected, the issue seems to be happening during the call to .generate(). This happens with every image I test with.

Anyone has any idea what could be the problem and if there's a solution?