ucbepic / docetl

A system for agentic LLM-powered data processing and ETL
https://docetl.org
MIT License
1.3k stars 118 forks source link

Cluster operation #48

Closed redhog closed 1 month ago

redhog commented 1 month ago

A bit similar to Resolve in aim, but without actually reducing the data:

Extension: Do this using hierarchical clustering, and add the path of clusters from the entry all the way to the top cluster encompassing the entire dataset, as an array of cluster names in a new field.

shreyashankar commented 1 month ago

I think this is a nice idea. We can implement this as a type of resolve, actually. The current resolve operator takes in the following:

A semantic cluster-based resolve would require the following parameters:

redhog commented 1 month ago

So I played around with hierarchical clustering with llama-index before, using scikit-learn AgglomerativeClustering as the clusterer. There, you can get all clusters, at all levels, as a tree, including cluster distances at all levels (pairwise). I then ran a recursive resolution-prompt llm to generate labels for all clusters. In fact, I generated both labels and descriptions, and used the descriptions, not the labels, as input to the next level resolution prompt to generate both description and label (as labels can be short enough to not be uniquely identifying the concept). It would be nice to represent all of that somehow.

I was thinking it would create something like:

cluster_on: "{{title}} - {{description}}" # Input to embedding model
output_key: categories
resolution_prompt: |
  Summarize the following descriptions of a concept into a single description and also provide a short title:

  {% for entry in inputs %}
    {{ entry.title }}: {{entry.description}}
  {% endfor %}

Example output


{
  "title": "Zebra",
  "description": "African equines with distinctive black-and-white striped coats. There are three living species: Grévy's zebra (Equus grevyi), the plains zebra (E. quagga),...",
  categories: [
    {
      "title": "Equus",
      "description": "A genus of mammals in the family Equidae, which includes horses, asses...",
      "distance": 0.01
    },
    {
      "title": "Equidae",
      "description": "The horse family is the taxonomic family of horses and related animals, including the extant horses...",
      "distance": 0.05
    },
    ...
    {
       "title": "Perissodactyla",
       "description": "An order of ungulates. The order includes about 17 living species divided into three families: Equidae , Rhinocerotidae, and Tapiridae ..."
       "distance": 0.12
    },
    ...
    {
       "title": "Eukaryote",
       "description": "Organisms whose cells have a membrane-bound nucleus...",
       "distance": 0.98
    }
  ]
}
redhog commented 1 month ago

I think the comparison prompt is kinda unnecessary btw: once the clusters are big enough, it's not gonna do anything really useful (how do you know if these concepts are too far from each other at this particular clustering level, in the llm?).

redhog commented 1 month ago

Nice thing with agglomerative clustering is that you do not have to provide a k: You get pairs of nodes as the lowest level clusters, and then the pairs are paired up, all the way to the top.