kijai / ComfyUI-Florence2

Inference Microsoft Florence2 VLM
MIT License
300 stars 17 forks source link

Needs some string cleanup to get rid of the <s> and </s> in the caption output #4

Open RandomGitUser321 opened 1 week ago

RandomGitUser321 commented 1 week ago

Updated: Ensured it works with batches of images as well

Add this stuff to filter out the special tokens. This also makes sure that all the other functions still work as well, since they rely on these tokens for things. This way, you can get a clean output that can be saved or used as a prompt.

I tested all the other features and they all worked the same.

for img in image:
            image_pil = F.to_pil_image(img)
            inputs = processor(text=prompt, images=image_pil, return_tensors="pt", do_rescale=False).to(dtype).to(device)

            generated_ids = model.generate(
                input_ids=inputs["input_ids"],
                pixel_values=inputs["pixel_values"],
                max_new_tokens=1024,
                do_sample=False,
                num_beams=3,
            )

            results = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

            # cleanup the special tokens from the final list before outputting it through the pin
            # makes it so that it doesn't break other functions that might require the tokens
            # probably a better way to handle it, but i'm not an expert at parsing
            cleaning = str(results)

            if len(results) > 0:
                cleaning = cleaning+"\n"+"\n" #adds a couple newlines between entries
            else:
                pass

            cleaning = cleaning.replace('</s>', '')
            cleaning = cleaning.replace('<s>', '')

            out_results.append(cleaning)

Also, at the very end:

        if not keep_model_loaded:
            print("Offloading model...")
            model.to(offload_device)

        cleaning = cleaning.rstrip()#get rid of extra newlines on the very last entry
        return (out_tensor, out_mask_tensor, out_results,)

Screenshots of the various modes working (only needed one caption type to show it working): image image image image image

I went back and redid some of it once I learned that it was having issues with just doing batch captions. It should work with all the options now and with batching of images as well.

RandomGitUser321 commented 1 week ago

Redid it to work with batched images: image image

kijai commented 1 week ago

Hey, thanks, you are right and this was on my to-do list. I do want the captions outputted as a list though with batches, so we can use it with nodes that understand string lists, for example: image

I've pushed this now.

RandomGitUser321 commented 1 week ago

Awesome, good idea! And thanks for getting Florence-2 to work with Windows by the way. I spent hours trying to get it to work with just a plain python script and failed. I really should see how you bypassed the damn flash-attention because it's become a large problem lately with a lot of different things I've been messing with (Lumina was one that you also got working as well).

Zanedname commented 1 week ago

Oh my God, I am just loving this person named Kijai so much.

kijai commented 1 week ago

Awesome, good idea! And thanks for getting Florence-2 to work with Windows by the way. I spent hours trying to get it to work with just a plain python script and failed. I really should see how you bypassed the damn flash-attention because it's become a large problem lately with a lot of different things I've been messing with (Lumina was one that you also got working as well).

With Lumina it's pretty much just necessary, it's more than twice as fast as SDP attention there. For these LLMs it seems unnecessary and the problem was that it didn't even try to run it without flash_attn due to a bug in the original code, the bypass only fixes that.