mivanit / tabGPT

use GPT to classify a bunch of your open tabs!
3 stars 0 forks source link

Thoughts on Classification Pipeline #9

Open rusheb opened 1 year ago

rusheb commented 1 year ago

Hi @mivanit,

I've been trying to understand how we could set up the classification pipeline using the prompted gpt2 model.

I am probably wrong about everything - my main reason for writing this is so that you can correct me.

What I think you are suggesting

I've tried to understand what you have in mind based on the roadmap you wrote up yesterday:

  • small set of manually classified data
  • generation of prompts from manual data + tag list
  • extraction of tags from a generated response

If I understand correctly, you are saying that we manually decide on the set of possible tags. Then we manually label some tab data according to those tags. Then we train GPT to generate the tags using normal next-token generation - i.e. each successive word generated would contain another tag. The tags would then be extracted algorithmically.

The main drawback of this approach in my mind is that we have to define the tags up front. I expect this would lead to reduced flexibility. Is there way of avoiding this?

I'm also not sure what we would end up doing with the resulting tags. You mentioned passing it to your PKM. I'd be keen to hear more about this option.

I hope I have understood you correctly but I doubt it! Please let me know what I'm missing here.

Unsupervised alternative

I was thinking about how we could do this in an unsupervised way. This is what I've come up with:

  1. Use the last hidden layer of GPT as features/embeddings and pass these to a clustering algorithm
  2. pass the groups back to GPT to generate a group name that summarizes the content of each group

This seems much better because we don't have to predefine the labels. However it would change the behaviour (see next section).

A possible drawback is that it might not give me the same learning outcomes, especially if we don't end up doing fine-tuning. I'm very unsure about this point, so let me know your opinion.

Behaviour

In my suggested option, we group tabs rather than tagging them. We could use this output to group the tabs in the browser -- but I'm not sure if there's a tab grouping extension which can be automated in this way. I suppose we could build an extension as a last resort. Alternatively, we could send the groupings to a text file, but I'm not sure what the value of this would be.

Maybe there is another unsupervised approach where we keep the tag functionality, but just encorage/force the model to produce a minimal number of tags per input. I'm not really sure what this would look like.

Fine-tuning

We could see how well this works as a baseline and then attempt to fine-tune if needed.

One approach to fine-tuning would be next-token-generation on an unlabelled dataset of tab/bookmark metadata. If I understand well enough, this would give the model a better "understanding" of how the tab metadata is structured, which would allow it to

Other Questions


Sorry, this ended up pretty long! Hope it's of some value. Looking forward to hearing your thoughts.

mivanit commented 1 year ago

classifying via continuation vs embeddings

Thanks for asking me to clarify, I'm realizing what I wrote was not very understandable.

classifying via continuation

a prompt might look like

# database of classified URLs
valid_tags: [LLMs, interpretability, AI_policy, conference_info]
classified_urls:
  - url: https://www.lesswrong.com/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1
    tags: [LLMs, interpretability]
  - url: https://neurips.cc/Conferences/2022/Dates
    tags: [conference_info]
  - url: https://arxiv.org/pdf/2212.07677.pdf
    tags: [

So, we provide a few-shot example of classification, some information about which tags might be valid (this can be omitted, but we will have to try both ways), and then ask it to generate a continuation which would reasonably consist of a list of tags.

This is what I had meant to do initially. For reasons of simplicity of prototyping and extensibility to GPT-3, it may still be useful.

The utility of finetuning here is that we would likely not need to provide any examples in the prompt, lowering inference time/cost. the model would already "know" that it needs to generate a tag in response to the prompt.

other notes:

classification via embeddings

Clustering via embeddings is a really good idea! It will take a bit more work, but I'm happy to try it out. My main concerns are:

It's worth noting that its probably still possible to use the embeddings but still bin the urls among some predetermined list of tags.

other parts of the pipeline

tabs vs bookmarks

The name of this project is probably a misnomer -- dealing with exported booksmarks from a static file seems a lot easier than talking to a browser and getting the active tabs (granted, I have never written a browser extension and have barely any javascript experience). We will probably be dealing with bookmarks for the foreseeable future.

output behavior

I use Dendron as a PKM extensively. A minimal pipeline for this project could simply look like

export bookmarks --> run script --> URLs added to particular notes in your dendron library

There are some possible ways to implement interaction with dendron:

rusheb commented 1 year ago

Thank you for the detailed clarification!

Based on this I have spent today working on a basic non-fine-tuned implementation.

I wrote a very rudimentary prompt by adding tags to the yaml output of process_urls:

- url: github.com/mivanit/tabGPT
  title: 'GitHub - mivanit/tabGPT: use GPT to classify a bunch of your open tabs!'
  headings:
  - mivanit/tabGPT
  - Name already in use
  - tabGPT
  - scripts
  - Roadmap
  - Development
  tags: [LLMs, GPT]
- url: arxiv.org/abs/2211.00593
  title: 'Interpretability in the Wild: a Circuit for Indirect Object Identification
    in GPT-2 small'
  subjects:
  - Machine Learning (cs.LG)
  - Artificial Intelligence (cs.AI)
  - Computation and Language (cs.CL)
  tags: [research papers, interpretability, LLMs]
- url: www.lesswrong.com/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1
  title: 403 Forbidden
  headings:
  - 403 Forbidden
  tags: [LLMs, interpretability]
- url: neurips.cc/Conferences/2022/Dates
  title: NeurIPS 2022
  tags: [conference_info]

Then used the following code to generate tags:

from preprocess_urls import get_url_meta
from generate_continuation import generate_continuation
import yaml

from pathlib import Path

def main():
    base_prompt_file = Path("data/prompt.yaml")
    url = "https://arxiv.org/pdf/2212.07677.pdf"

    if not base_prompt_file.exists:
        raise ValueError(
            f"Base prompt file {base_prompt_file.absolute} does not exist."
        )

    prompt = generate_prompt(url, base_prompt_file)
    print(prompt)
    continuation = generate_continuation(prompt, max_length=30, stop_token="]")

    tags = extract_tags(continuation)
    print(f"Tags for {url}: {tags}")

def generate_prompt(url: str, base_prompt_file: Path):
    # have to wrap in a list to make sure it's the same format as the base_prompt
    # TODO tidy this up
    metadata = [get_url_meta(url)]

    base_prompt = base_prompt_file.read_text()
    url_prompt = yaml.dump(metadata, sort_keys=False)

    return base_prompt.strip() + "\n" + url_prompt + "  tags: ["

def extract_tags(continuation: str):
    return continuation.split(", ")

if __name__ == "__main__":
    main()

You can also find this on my branch rusheb-pipeline.

It seems to work! This is the output:

Tags for https://arxiv.org/pdf/2212.07677.pdf: ['research papers', 'interpretability', 'LLMs']