I think some are confused on the goals of the extension, and I really just uploaded what was needed to set it up without going into depth on what the ideal state actually looks like, so let me explain. Keep in mind I'm cutting a lot out and just shitting out what's in my head.

          I think some are confused on the goals of the extension, and I really just uploaded what was needed to set it up without going into depth on what the ideal state actually looks like, so let me explain. Keep in mind I'm cutting a lot out and just shitting out what's in my head.

The ability to use an embedding store as a database of text is not new. It has been around for a long time, and it is a simple concept. Cosine similarity allows us to take an input text, get its embeddings, and use those embeddings to find relevant text information in an embeddings store. If that was all this extension was going to be -- but for context -- I wouldn't have even bothered to write the code. You can do this with Langchain out-of-the-box with some simple steps. But that is not the only thing I intend to do here.

LLMs are notoriously bad at picking out the finer bits of context to use when generating the output. It is difficult to get them to comply with minute instructions and small details in the input prompt. Vector stores help with that by letting us filter down the prompt to only the portions that are most relevant to the input, and they can hold much more data than the LLM can in it's limited context size -- but they're still very lossy.

This is where Focus comes in. This is the actual goal of the extension and I'm in the process of adding it. Focus is configurable reasoning.

What does that mean? The way vector stores are usually used is storing text information and retrieving it semantically -- using natural-language. The retrieved text is injected into the prompt, and you're done with it until the next input comes along. But what if we stacked a layer in between the input and the vector store -- specifically, a layer that can be configured based on any data, not just the input prompt, and guides the retrieval of information from the vector store? Not a where filter, but something more complex.

What we could do is use a simple list of embeddings that themselves are stored in a vector store. We compare the input embeddings and/or any other miscellaneous data to retrieve the top 1 result from the list. The result is a string, but it can be mapped to anything, and we can use the mapping to further configure how, what, or even when we retrieve data from the context store.

Think of this scenario: you have a source added for the homepage of Food.com or some other recipe site. You ask the model, "What kinds of recipe can you help me to make?". In the Focus store, we have a list that contains some operators that may or may not be correlated with input intent, including the operation "list". By using the input prompt, we fetch the top 1 result from our Focus operators, and it comes back with the "list" operation. We mapped "list" to a function that parses the webpage and formats it into distinct sections of links and link label, the function then runs that list of sections through another distance calculation to get appropriate headings for each section (so the site directory has the heading "navigation", and the section with the recipes is marked with "recipe".

Rather than returning the raw data from the context store, we instead return this transformed version, and by running a calculation again on the input and this new data, we would get back the chunk that contains the recipe list. The LLM now only sees the most relevant information from the site, so it should have no problem answering the original question.

That is the power of focus -- we can program the reasoning of the LLM. Not directly at the model level, but by limiting what it sees in the context. We can control how the LLM "reasons" by controlling what it can see with simple operations, giving us finer control on the output of the model.

Let's look at another example from an anon on /lmg/:

So for example, if a character pocketed a gun early on, and you ask if they have a gun, it might remember that. But they might be in a situation where a gun is called for, and not use it

Focus operators will solve this issue.

Imagine in the future we have some sort of giant "Personality Compendium", just a huge TXT where each section is labeled with a personality trait, and some characteristics of that trait defined as focus operations. We have a Character Focus layer set up that pulls from the compendium, the character personality definition file and the character's memory file -- note that we can have multiple Focus layers pulling from different combinations of files/sources, and each Focus layer can have multiple steps. One of the compendium sections -- "survivalist" -- contains the traits for hardy characters and has a label-type focus operation that looks like this (spit balling on the syntax): ambience: dangerous?: memory.weapon

This has a top-level label "ambience", a sub-label "dangerous", and a final label "weapon". The intent is to tell the Focus layer that when the ambience of the scene is dangerous, find all context chunks in the character memory bucket that relate to "weapon". We can even add recency bias and also the character name if we wanted some automagic formatting behind the scenes (spit balling on the syntax): ambience: dangerous?: memory.weapon +{name} +bias:time

This says, if the ambience is dangerous, search all chunks that relate to "weapon + character name" in this character's memory file with a bias towards more recent information.

A situation occurs, some text is generated and we run this through the character focus layer. We first do an embedding search for the output text + the word "ambience", with a defined list of outputs, one of them being "dangerous". We could even have this step further guided by the character's personality -- maybe they have a bias to perceiving situations as dangerous, giving it a stronger weight. Remember -- this is all configurable. If we get "dangerous" as the top 1 result, this matches the focus operator of our survivalist character, so we now do another embedding search with the word "weapon" + character name with a bias towards more recent chunks in the memory bucket. We get 3 matches, maybe one of these referencing the character having a weapon.

We then inject the retrieved chunks into the context. Blah blah blah, other chunks are generated via other operators and also added to the pool. Now, when the character's response is generated, they can only see the personality traits that apply to their character, the situation at hand, and the fact they have a weapon, along with other information. Because the context has been massaged, the LLM now has the optimal information for playing the character's role.

In the ideal state, users would just drop in a simple character card and a compendium and their character would stay true to their personality, no matter what. Regardless of context size, regardless of memory limitations, regardless of the inherent reasoning capabilities of the underlying model.

Do I have a personality compendium on-hand? No. Do I know what the syntax will look like? No. But that's not the point. This would allow you to program the output of the model without having to find the perfect dataset, or the best finetune, or the best generation settings, or retraining a model, or waiting for a model that has 1 bajillion context that can't fit onto 24 GB of VRAM.

Originally posted by @kaiokendev in https://github.com/oobabooga/text-generation-webui/issues/1548#issuecomment-1524610849

oobabooga / text-generation-webui

I think some are confused on the goals of the extension, and I really just uploaded what was needed to set it up without going into depth on what the ideal state actually looks like, so let me explain. Keep in mind I'm cutting a lot out and just shitting out what's in my head. #3012