Closed redhog closed 1 month ago
I think this is a nice idea. We can implement this as a type of resolve, actually. The current resolve operator takes in the following:
A semantic cluster-based resolve would require the following parameters:
So I played around with hierarchical clustering with llama-index before, using scikit-learn AgglomerativeClustering as the clusterer. There, you can get all clusters, at all levels, as a tree, including cluster distances at all levels (pairwise). I then ran a recursive resolution-prompt llm to generate labels for all clusters. In fact, I generated both labels and descriptions, and used the descriptions, not the labels, as input to the next level resolution prompt to generate both description and label (as labels can be short enough to not be uniquely identifying the concept). It would be nice to represent all of that somehow.
I was thinking it would create something like:
cluster_on: "{{title}} - {{description}}" # Input to embedding model
output_key: categories
resolution_prompt: |
Summarize the following descriptions of a concept into a single description and also provide a short title:
{% for entry in inputs %}
{{ entry.title }}: {{entry.description}}
{% endfor %}
Example output
{
"title": "Zebra",
"description": "African equines with distinctive black-and-white striped coats. There are three living species: Grévy's zebra (Equus grevyi), the plains zebra (E. quagga),...",
categories: [
{
"title": "Equus",
"description": "A genus of mammals in the family Equidae, which includes horses, asses...",
"distance": 0.01
},
{
"title": "Equidae",
"description": "The horse family is the taxonomic family of horses and related animals, including the extant horses...",
"distance": 0.05
},
...
{
"title": "Perissodactyla",
"description": "An order of ungulates. The order includes about 17 living species divided into three families: Equidae , Rhinocerotidae, and Tapiridae ..."
"distance": 0.12
},
...
{
"title": "Eukaryote",
"description": "Organisms whose cells have a membrane-bound nucleus...",
"distance": 0.98
}
]
}
I think the comparison prompt is kinda unnecessary btw: once the clusters are big enough, it's not gonna do anything really useful (how do you know if these concepts are too far from each other at this particular clustering level, in the llm?).
Nice thing with agglomerative clustering is that you do not have to provide a k: You get pairs of nodes as the lowest level clusters, and then the pairs are paired up, all the way to the top.
A bit similar to Resolve in aim, but without actually reducing the data:
Extension: Do this using hierarchical clustering, and add the path of clusters from the entry all the way to the top cluster encompassing the entire dataset, as an array of cluster names in a new field.