[ ] simonw/llm-cluster: LLM plugin for clustering embeddings

simonw/llm-cluster: LLM plugin for clustering embeddings

Snippet

"llm-cluster

LLM plugin for clustering embeddings

Background on this project: Clustering with llm-cluster.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster
Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

Content

LLM plugin for clustering embeddings

Background on this project: Clustering with llm-cluster.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster

Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

First, use paginate-json and jq to populate a collection. In this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db database:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store

The --store flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:

llm cluster llm-issues 10 \
  -d issues.db

If you omit the -d option the default embeddings database will be used.

The output should look something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

The content displayed is truncated to 100 characters. Pass --truncate 0 to disable truncation, or --truncate X to truncate to X characters.

Generating summaries for each cluster

The --summary flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the --truncate option) through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.

This feature only works for embeddings that have had their associated content stored in the database using the --store flag.

You can use it like this:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary

This uses the default prompt and the default model.

Partial example output:

[
  {
    "id": "5",
    "items": [
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      },
      {
        "id": "1650757081",
        "content": "Command for browsing captured logs"
      }
    ],
    "summary": "Log Management and Interactive Prompt Tracking"
  },
  {
    "id": "6",
    "items": [
      {
        "id": "1650771320",
        "content": "Mechanism for continuing an existing conversation"
      },
      {
        "id": "1740090291",
        "content": "-c option for continuing a chat (using new chat_id column)"
      },
      {
        "id": "1784122278",
        "content": "Figure out truncation strategy for continue conversation mode"
      }
    ],
    "summary": "Continuing Conversation Mechanism and Management"
  }
]

To use a different model, e.g. GPT-4, pass the --model option:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4

The default prompt used is:

Short, concise title for this cluster of related documents.

To use a custom prompt, pass --prompt:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'

A "summary" key will be added to each cluster, containing the generated summary.

irthomasthomas / undecidability