Closed nomadoor closed 8 months ago
We actually have a better way to get structured outputs, i was thinking about this. We have llama-cpp-agents, it uses pydantic to get structured outputs and uses every kind of functionality. This is already installed inside VLM nodes and i use it in some nodes like Suggester and KeywordExtraction nodes. I might add a another node for users to specify what kind of outputs they want. Like this:
Type:List, Description:(furnitures in the picture) etc...
https://github.com/Maximilian-Winter/llama-cpp-agent
I will think about this i dont know if it will be used widely though.
That’s interesting. I’ve been curious about the behavior of the Suggester node!
In ComfyUI, I feel that the LLM may be more useful as a somewhat luxurious classifier than for natural language dialogue.
I would be happy if you could consider introducing it positively!
Also in the mean time you can change the code inside the suggest.py to get your desired outputs. You only need to change field names and their descriptions. Especially in the KeywordExtraction node. It's really easy to manipulate.
If you want to use the node as a classifier. You need to use type Literal. Lets say you want to extract a person's eyes are open or closed. You just need to change the Analysis class in the top of the code to this. This way KeywordExtraction node will give you only the eye condition of the person. As a JSON object.
from typing import Literal
class Analysis(BaseModel):
eye_condition: Literal["closed" , "open"]
This is just a temporal quick fix, i need to create more dynamic and usable node.
I tried the KeywordExtraction node for now, and it outputted as a binary of 1 or 0, just as I expected! I will also try the literal.
I need to add this directly to the LLavaSampler like a LLavaKeywordExtraction, currently KeywordExtraction takes only the text input.
Yes, that would greatly improve usability and versatility😍
When you think about this. It can do step by step thinking and can call the llm multiple times. I can say extract keywords from given description, create different prompts for each keyword. Divide this "1024x1024" canvas into extracted keyword count, assign them coordinates and height and witdh and give me back the coordinates and prompts. With this structured format. We can use this output as a regional prompter for example. That would be great creative use case. I saw similar process to this.
RPG-DiffusionMaster is doing exactly that. No one in ComfyUI has been able to implement it yet, but it’s exciting to think that it could be realized so simply!
I will try it using VLM nodes. At least getting coordinates and prompts part. Other parts can be adapted using other custom_nodes.
it doesnt look that hard. I made an experiment and it works, i just need to adap it to create nice prompts and creating better regions
this is the result.
currently it creates random areas in canvas and makes prompts for each area
It’s perfect! The problem is how to pass this to ComfyUI… Both ConditioningSetArea and attention-couple and regional prompt require one node for one area, so they can’t be combined well.
this is my workflow maybe this kind of workflow can do it
https://github.com/mirabarukaso/ComfyUI_Mira i also discovered this node, we can use his methods i think.
workflow-Area_Cimposition_Examples.json
also this
Being able to create a mask by specifying an area is good, but in the end, it’s connected to the regional prompt of the impact pack, so the need for N nodes to set the condition remains unchanged…hmm... We need something like fizz nodes for area specification.
i think x,y values are a little bit weird i need to think about their position but otherwise this is pretty okay for me
i need a reasoning step in the middle how it should divide the canvas with like webui format 1,1,2;1,2,4,6 kind of formatting. that way it would be easy to add this to mira nodes. it should understand from the meaning of the keyword where it should put it, like sky above, sea below, person left, right.
It took me some time to understand this format, it’s not very intuitive… Personally, I feel a format like CSS grid is simpler and also easier for LLM to understand, but it can’t utilize mira nodes.
yes, its very confusing, i didnt even understand it for like a 20 minutes. but if it can be automatically done, people wont notice it i think. it can be done internally.
As you say, as long as we can understand LLM, it’s not a problem that needs to be prioritized. However, how should we instruct this format…
i made this kind of node is this what you want in the first place?
i added an option to answers pick from selected categories
Wow! This is exactly what I’ve been looking for! It’s amazing!
I've added this to repository, you can update the nodes.
Node is called Structured Output
This is very easy to use! Thank you. Are you planning to support LLaVA? It would be more versatiley to implement it as an option in the LLM/LLaVA Sampler, rather than creating a new node…
currently it is not available in llava models, i will try to make it but llama-cpp-agent doesnt support multimodel structured outputs for now.
I see, that's how it is. For the time being, it should be fine to pass the output of LLaVA to StructuredOutput.
I learned that llama-cpp has an option to specify a GBNF grammar format.
The ability to specify formats precisely in this way, rather than through prompts, is very appealing, especially since I’m using VLM/LLM for conditional branching and as a tagger.
I would appreciate it if this could be entered as an option for the LLaVA/LLM Sampler.
Thank you.