kijai / ComfyUI-Florence2

Inference Microsoft Florence2 VLM
MIT License
645 stars 40 forks source link

[Discussion] Some Observations on Florence2 #25

Open dnl13 opened 3 months ago

dnl13 commented 3 months ago

Hello, I am testing Florence2 on a fork from another repo made by spacepxl. https://github.com/spacepxl/ComfyUI-Florence-2

I chose this repository because the approach with the XY coordinates, etc., can be more comfortably used later for inpainting/upscale purposes. It would be great if you could both collaborate with a single repository for Florence-2. This would surely help in pooling resources more effectively. @spacepxl @kijai

Regarding Issue https://github.com/kijai/ComfyUI-Florence2/issues/11 I noticed that lines appear when the model does not output the coordinates for the mask beyond a certain point. For example, with max_tokens, I was sometimes able to increase the max_tokens beyond 1024, and then the mask continued to be drawn. However, Comfy often crashes when the value exceeds 1024. Unfortunately, I have not yet found a way to "simplify" the masks from the model.

You can observe this behavior if you take an image with a small subject and use the prompt hair with the task <REFERRING_EXPRESSION_SEGMENTATION> with 1024 tokens, getting the full mask. Then, if you significantly reduce the max_tokens, you will see that the mask eventually gets cut off.

I have no idea why the model outputs incomplete data or if this process can be improved by batch processing.

Currently, the models do not seem fully mature to me. Even the fine-tuned models provide fewer coordinates if you manage to get the same mask. For example, when segmenting hair, fine-tuned models often show the whole character instead of just the hair, whereas the non-fine-tuned model works right away. I tried using num_beams to get the hair mask with the fine-tuned model, but it's not very reliable.

Maybe this will help in troubleshooting.

I also noticed problems with elements like eyes, arms, or legs when multiple parts are visible in the image. A workaround that has worked was using prompts like: both eyes or left eye, right eye, left arm, right arm https://huggingface.co/microsoft/Florence-2-large/discussions/10

However, this is far from satisfactory.

I get the impression that the model is not very good at body segmentation or not adequately trained for it.

What segmentation tasks would you mainly use the model for? Perhaps it could be fine-tuned for such tasks, although I have never done that myself.

Currently, I am working on detecting and masking multiple objects "simultaneously" using the tasks <REFERRING_EXPRESSION_SEGMENTATION>, <REGION_TO_SEGMENTATION>, e.g., mouth, hair, dress .

This is working relatively well, but I wanted to clear this with my previous pull request with Spacepxl before committing it.

Processing image batches is next on my to-do list.

Do you have any ideas on how we can improve our experience with Florence2 for ComfyUI together? Best regards, dnl

kijai commented 3 months ago

I have the node outputting square masks for the detected boxes already, I did think about bboxes too but it really just depend what node it needs to pair with.

The segmentation is indeed disappointing, I had the same though of chaining the detections and using the region segmentation instead, but I'm afraid it still has same issue with the incomplete masks. I tested drawing just the points too to be sure the problem isn't the drawing of the polygons. I have also tried adjusting the max tokens and other generic LLM parameters, only observation I had was same as you: reducing the max tokens cuts it off earlier. Going past 1024 never crashed for me, but it didn't help any either.

dnl13 commented 3 months ago

Very interesting observation. My thought was already to integrate a SAM model for mask generation behind the boxes, as the box detection is already considerably faster and more accurate than some other models I've had so far. But from my perspective, that would be too much.

Yes, I can understand the concern with bounding boxes. I think a widely used library is the Impact Pack, which can also expect bounding boxes and SEGs for its detailers and upscalers.

I delved into Dr.Lt.Data's code to see how SEGs are structured. This seems to be a worthwhile approach to me, especially since Dr.Lt.Data will also be a core competency team at comfy.org. I believe that this type of approach could prevail in the future, considering how much is already included in a SEG (x, y, width, height, cropped image, etc.).

I think it would be beneficial to establish good connectivity with the other nodes in the Comfy universe at this point. Pictures are also fine. Just my two thoughts.

But thank you very much for your feedback, and it's a shame that the observations are similar. However, I am optimistic that Florence2 could establish itself. I find the masks it creates to be very good (compared to others like ClipSeg)