rusheb commented 1 year ago

Hi @mivanit,

I've been trying to understand how we could set up the classification pipeline using the prompted gpt2 model.

I am probably wrong about everything - my main reason for writing this is so that you can correct me.

What I think you are suggesting

I've tried to understand what you have in mind based on the roadmap you wrote up yesterday:

small set of manually classified data

generation of prompts from manual data + tag list

extraction of tags from a generated response

If I understand correctly, you are saying that we manually decide on the set of possible tags. Then we manually label some tab data according to those tags. Then we train GPT to generate the tags using normal next-token generation - i.e. each successive word generated would contain another tag. The tags would then be extracted algorithmically.

The main drawback of this approach in my mind is that we have to define the tags up front. I expect this would lead to reduced flexibility. Is there way of avoiding this?

I'm also not sure what we would end up doing with the resulting tags. You mentioned passing it to your PKM. I'd be keen to hear more about this option.

I hope I have understood you correctly but I doubt it! Please let me know what I'm missing here.

Unsupervised alternative

I was thinking about how we could do this in an unsupervised way. This is what I've come up with:

Use the last hidden layer of GPT as features/embeddings and pass these to a clustering algorithm
pass the groups back to GPT to generate a group name that summarizes the content of each group

This seems much better because we don't have to predefine the labels. However it would change the behaviour (see next section).

A possible drawback is that it might not give me the same learning outcomes, especially if we don't end up doing fine-tuning. I'm very unsure about this point, so let me know your opinion.

Behaviour

In my suggested option, we group tabs rather than tagging them. We could use this output to group the tabs in the browser -- but I'm not sure if there's a tab grouping extension which can be automated in this way. I suppose we could build an extension as a last resort. Alternatively, we could send the groupings to a text file, but I'm not sure what the value of this would be.

Maybe there is another unsupervised approach where we keep the tag functionality, but just encorage/force the model to produce a minimal number of tags per input. I'm not really sure what this would look like.

Fine-tuning

We could see how well this works as a baseline and then attempt to fine-tune if needed.

One approach to fine-tuning would be next-token-generation on an unlabelled dataset of tab/bookmark metadata. If I understand well enough, this would give the model a better "understanding" of how the tab metadata is structured, which would allow it to

classifying via continuation vs embeddings

Thanks for asking me to clarify, I'm realizing what I wrote was not very understandable.

classifying via continuation

a prompt might look like

# database of classified URLs
valid_tags: [LLMs, interpretability, AI_policy, conference_info]
classified_urls:
  - url: https://www.lesswrong.com/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1
    tags: [LLMs, interpretability]
  - url: https://neurips.cc/Conferences/2022/Dates
    tags: [conference_info]
  - url: https://arxiv.org/pdf/2212.07677.pdf
    tags: [

So, we provide a few-shot example of classification, some information about which tags might be valid (this can be omitted, but we will have to try both ways), and then ask it to generate a continuation which would reasonably consist of a list of tags.

This is what I had meant to do initially. For reasons of simplicity of prototyping and extensibility to GPT-3, it may still be useful.

The utility of finetuning here is that we would likely not need to provide any examples in the prompt, lowering inference time/cost. the model would already "know" that it needs to generate a tag in response to the prompt.

other notes:

the prompt would of course actually include more than just the URL -- we need metadata that GPT can read
- for arxiv papers, we already get the abstract, categories, authors, etc.
- For now, the headings and page title are also scraped for general webpages, but
- it might be good to build a simple interface to make it easier for people to contribute webpage-specific processors. This might be a SoupProcessor class with a check(self, url: str, soup: BeautifulSoup) -> bool method which returns whether the the process(url: str, soup: BeautifulSoup) -> dict method should be applied to extract information from the webpage.
When it comes to having GPT generate the labels, we can simply omit the "valid_tags" list -- from previous experience with GPT, it might be that these tags would be of limited usefullness due to their generality. i.e. if you give it an arxiv URL, it will probably just always spit out "scientific paper" or maybe "machine learning paper". I think this is something we should try empirically.
I am omitting here the complexity resulting from various sampling methods, and in fact it would probably be optimal to look at the logprobs directly rather than sampling, but this can be done incrementally after having a minimal viable pipeline.

classification via embeddings

Clustering via embeddings is a really good idea! It will take a bit more work, but I'm happy to try it out. My main concerns are:

I was under the impression that finetuning an model and then getting embeddings from it was not possible via the openAI api -- after some digging I'm still under this impression, but its maybe a 70% confidence
the Embeddings API docs state that GPT3-Davinci (~175B params) embeddings are no more useful than GPT3-Ada (~1B params, but I dont remember exactly) embeddings. This leads me to believe that possibly doing clustering on embeddings does not have the same scaling laws as simple autoregressive text prediction.
- it would probably be good to actually do some kind of literature review on this -- and if noone has written about the scaling laws, this could make a nice little paper. I would think that this has implications for how linearly separable the embeddings are at various model scales.

It's worth noting that its probably still possible to use the embeddings but still bin the urls among some predetermined list of tags.

other parts of the pipeline

tabs vs bookmarks

The name of this project is probably a misnomer -- dealing with exported booksmarks from a static file seems a lot easier than talking to a browser and getting the active tabs (granted, I have never written a browser extension and have barely any javascript experience). We will probably be dealing with bookmarks for the foreseeable future.

output behavior

I use Dendron as a PKM extensively. A minimal pipeline for this project could simply look like

export bookmarks --> run script --> URLs added to particular notes in your dendron library

There are some possible ways to implement interaction with dendron:

Dendron (as do most PKMs) has a hierarchy of notes -- one might create a tabGPT category, create downstream notes (tabGPT.LLms, tabGPT.AI_policy, etc) and then tabGPT, upon running, will append the new tabs to the appropriate file, depending on how it was classified.

Dendron also has a tag system, with backlinks. If we want to support multiple tags, we might simply print a list of URLs to a file in the user's dendron vault, with appropriate tags. This might look like

- [Basic facts about language model internals](https://www.lesswrong.com/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1) #tags.LLMs #tags.interpretability
- [NeurIPS](https://neurips.cc/Conferences/2022/Dates) #tags.conference_info

rusheb commented 1 year ago

Thank you for the detailed clarification!

Based on this I have spent today working on a basic non-fine-tuned implementation.

I wrote a very rudimentary prompt by adding tags to the yaml output of process_urls:

- url: github.com/mivanit/tabGPT
  title: 'GitHub - mivanit/tabGPT: use GPT to classify a bunch of your open tabs!'
  headings:
  - mivanit/tabGPT
  - Name already in use
  - tabGPT
  - scripts
  - Roadmap
  - Development
  tags: [LLMs, GPT]
- url: arxiv.org/abs/2211.00593
  title: 'Interpretability in the Wild: a Circuit for Indirect Object Identification
    in GPT-2 small'
  subjects:
  - Machine Learning (cs.LG)
  - Artificial Intelligence (cs.AI)
  - Computation and Language (cs.CL)
  tags: [research papers, interpretability, LLMs]
- url: www.lesswrong.com/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1
  title: 403 Forbidden
  headings:
  - 403 Forbidden
  tags: [LLMs, interpretability]
- url: neurips.cc/Conferences/2022/Dates
  title: NeurIPS 2022
  tags: [conference_info]

Then used the following code to generate tags:

from preprocess_urls import get_url_meta
from generate_continuation import generate_continuation
import yaml

from pathlib import Path

def main():
    base_prompt_file = Path("data/prompt.yaml")
    url = "https://arxiv.org/pdf/2212.07677.pdf"

    if not base_prompt_file.exists:
        raise ValueError(
            f"Base prompt file {base_prompt_file.absolute} does not exist."
        )

    prompt = generate_prompt(url, base_prompt_file)
    print(prompt)
    continuation = generate_continuation(prompt, max_length=30, stop_token="]")

    tags = extract_tags(continuation)
    print(f"Tags for {url}: {tags}")

def generate_prompt(url: str, base_prompt_file: Path):
    # have to wrap in a list to make sure it's the same format as the base_prompt
    # TODO tidy this up
    metadata = [get_url_meta(url)]

    base_prompt = base_prompt_file.read_text()
    url_prompt = yaml.dump(metadata, sort_keys=False)

    return base_prompt.strip() + "\n" + url_prompt + "  tags: ["

def extract_tags(continuation: str):
    return continuation.split(", ")

if __name__ == "__main__":
    main()

You can also find this on my branch rusheb-pipeline.

It seems to work! This is the output:

Tags for https://arxiv.org/pdf/2212.07677.pdf: ['research papers', 'interpretability', 'LLMs']

mivanit / tabGPT

Thoughts on Classification Pipeline #9