gokayfem / ComfyUI_VLM_nodes

Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
Apache License 2.0
422 stars 37 forks source link

[Feature Request] Support GBNF grammar #49

Closed nomadoor closed 8 months ago

nomadoor commented 8 months ago

I learned that llama-cpp has an option to specify a GBNF grammar format.

The ability to specify formats precisely in this way, rather than through prompts, is very appealing, especially since I’m using VLM/LLM for conditional branching and as a tagger.

I would appreciate it if this could be entered as an option for the LLaVA/LLM Sampler.

Thank you.

gokayfem commented 8 months ago

We actually have a better way to get structured outputs, i was thinking about this. We have llama-cpp-agents, it uses pydantic to get structured outputs and uses every kind of functionality. This is already installed inside VLM nodes and i use it in some nodes like Suggester and KeywordExtraction nodes. I might add a another node for users to specify what kind of outputs they want. Like this:

Type:List, Description:(furnitures in the picture) etc...

https://github.com/Maximilian-Winter/llama-cpp-agent

I will think about this i dont know if it will be used widely though.

nomadoor commented 8 months ago

That’s interesting. I’ve been curious about the behavior of the Suggester node!

In ComfyUI, I feel that the LLM may be more useful as a somewhat luxurious classifier than for natural language dialogue.

I would be happy if you could consider introducing it positively!

gokayfem commented 8 months ago

Also in the mean time you can change the code inside the suggest.py to get your desired outputs. You only need to change field names and their descriptions. Especially in the KeywordExtraction node. It's really easy to manipulate.

gokayfem commented 8 months ago

If you want to use the node as a classifier. You need to use type Literal. Lets say you want to extract a person's eyes are open or closed. You just need to change the Analysis class in the top of the code to this. This way KeywordExtraction node will give you only the eye condition of the person. As a JSON object.

from typing import Literal

class Analysis(BaseModel): 
    eye_condition: Literal["closed" , "open"]

This is just a temporal quick fix, i need to create more dynamic and usable node.

nomadoor commented 8 months ago

extraction I tried the KeywordExtraction node for now, and it outputted as a binary of 1 or 0, just as I expected! I will also try the literal.

gokayfem commented 8 months ago

I need to add this directly to the LLavaSampler like a LLavaKeywordExtraction, currently KeywordExtraction takes only the text input.

nomadoor commented 8 months ago

Yes, that would greatly improve usability and versatility😍

gokayfem commented 8 months ago

When you think about this. It can do step by step thinking and can call the llm multiple times. I can say extract keywords from given description, create different prompts for each keyword. Divide this "1024x1024" canvas into extracted keyword count, assign them coordinates and height and witdh and give me back the coordinates and prompts. With this structured format. We can use this output as a regional prompter for example. That would be great creative use case. I saw similar process to this.

nomadoor commented 8 months ago

RPG-DiffusionMaster is doing exactly that. No one in ComfyUI has been able to implement it yet, but it’s exciting to think that it could be realized so simply!

gokayfem commented 8 months ago

I will try it using VLM nodes. At least getting coordinates and prompts part. Other parts can be adapted using other custom_nodes.

gokayfem commented 8 months ago

image

it doesnt look that hard. I made an experiment and it works, i just need to adap it to create nice prompts and creating better regions

gokayfem commented 8 months ago

image

this is the result.

image

currently it creates random areas in canvas and makes prompts for each area

nomadoor commented 8 months ago

It’s perfect! The problem is how to pass this to ComfyUI… Both ConditioningSetArea and attention-couple and regional prompt require one node for one area, so they can’t be combined well.

gokayfem commented 8 months ago

image

https://openart.ai/workflows/nondaa/put-two-person-together-automatically---autoprompting-regional-prompt-ipadapter-controlnet/1chHG0gh6pEy1sZvVE1j

this is my workflow maybe this kind of workflow can do it

gokayfem commented 8 months ago

https://github.com/mirabarukaso/ComfyUI_Mira i also discovered this node, we can use his methods i think.

gokayfem commented 8 months ago

ComfyUI_00515_

workflow-Area_Cimposition_Examples.json

also this

nomadoor commented 8 months ago

Being able to create a mask by specifying an area is good, but in the end, it’s connected to the regional prompt of the impact pack, so the need for N nodes to set the condition remains unchanged…hmm... We need something like fizz nodes for area specification.

gokayfem commented 8 months ago

image

i think x,y values are a little bit weird i need to think about their position but otherwise this is pretty okay for me

gokayfem commented 8 months ago

i need a reasoning step in the middle how it should divide the canvas with like webui format 1,1,2;1,2,4,6 kind of formatting. that way it would be easy to add this to mira nodes. it should understand from the meaning of the keyword where it should put it, like sky above, sea below, person left, right.

nomadoor commented 8 months ago

It took me some time to understand this format, it’s not very intuitive… Personally, I feel a format like CSS grid is simpler and also easier for LLM to understand, but it can’t utilize mira nodes.

gokayfem commented 8 months ago

yes, its very confusing, i didnt even understand it for like a 20 minutes. but if it can be automatically done, people wont notice it i think. it can be done internally.

nomadoor commented 8 months ago

As you say, as long as we can understand LLM, it’s not a problem that needs to be prioritized. However, how should we instruct this format…

gokayfem commented 8 months ago

image

image

i made this kind of node is this what you want in the first place?

image

i added an option to answers pick from selected categories

image

nomadoor commented 8 months ago

Wow! This is exactly what I’ve been looking for! It’s amazing!

gokayfem commented 8 months ago

I've added this to repository, you can update the nodes.

structured

Node is called Structured Output

nomadoor commented 8 months ago

This is very easy to use! Thank you. Are you planning to support LLaVA? It would be more versatiley to implement it as an option in the LLM/LLaVA Sampler, rather than creating a new node…

gokayfem commented 8 months ago

currently it is not available in llava models, i will try to make it but llama-cpp-agent doesnt support multimodel structured outputs for now.

nomadoor commented 8 months ago

I see, that's how it is. For the time being, it should be fine to pass the output of LLaVA to StructuredOutput.