One last tip for the `evaluator.py` hw

mikeizbicki commented 1 month ago

There's one last trick that I forgot to mention in class for getting good results on these "multiple choice question" or "classifier-type" prompts: you should actually tell the model what the possible options are! I modified the code that we did in class to add the options to the instructions, and it resulted in the model getting the right answer on all the simple test cases we had.

The specific change I made is adding the following line to the prompt:

Valid values include: {valid_labels}

where valid_labels is a local variable defined to be ['Harris', 'Trump']. That works for our simple test cases, but to work with the real dataset, you'll need to load the valid labels from the data.

My final evaluator.py looks like:

'''
This file is for evaluating the quality of our RAG system
using the Hairy Trumpet tool/dataset.
'''

import ragnews

class RAGEvaluator:
    def predict(self, masked_text):
        ''' 
        >>> model = RAGEvaluator()
        >>> model.predict('There no mask token here.')
        []
        >>> model.predict('[MASK0] is the democratic nominee')
        ['Harris']
        >>> model.predict('[MASK0] is the democratic nominee and [MASK1] is the republican nominee')
        ['Harris', 'Trump']
        '''
        # you might think about:
        # calling the ragnews.run_llm function directly;
        # so we will call the ragnews.rag function

        valid_labels = ['Harris', 'Trump']

        db = ragnews.ArticleDB('ragnews.db')
        textprompt = f'''
This is a fancier question that is based on standard cloze style benchmarks.
I'm going to provide you a sentence, and that sentence will have a masked token inside of it that will look like [MASK0] or [MASK1] or [MASK2] and so on.
And your job is to tell me what the value of that masked token was.
Valid values include: {valid_labels}

The size of you output should just be a single word for each mask.
You should not provide any explanation or other extraneous words.
If there are multiple mask tokens, provide each token separately with a whitespace in between.

INPUT: [MASK0] is the democratic nominee
OUTPUT: Harris

INPUT: [MASK0] is the democratic nominee and [MASK1] is the republican nominee
OUTPUT: Harris Trump

INPUT: {masked_text}
OUTPUT: '''
        output = ragnews.rag(textprompt, db, keywords_text=masked_text)
        return output

        # Reasons why bad results:
        # 1. the code (esp the prompt) in this function is bad
        # 2. the rag function itself could be bad
        #   
        # In order to improve (1) above:
        # 1. prompt engineering didn't work
        # 2. I had to change what model I was using

import logging
logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    level=logging.INFO,
    )

The version I created on my own before class seems to be just a little bit better (because I was more careful making the wording in the prompt clearer). But I got 74% accuracy on the hw dataset with this prompt. I suspect that this problem should really be very simple, and with some better prompt engineering it will be possible to get up to >90% accuracy. (100% accuracy is probably not possible because I think some of the sentences are genuinely ambiguous like [MASK0] is running for president.)

ains-arch commented 1 month ago

Since RAG is basically ~~sparkling prompt engineering~~ just a wrapper on a model irrespective of the underlying architecture, can I try gluing my RAG code to a masked BERT model from HuggingFace or an non Groq masked LLM API for this evaluation because prompt engineering my long-suffering Llama model into only outputting a single digit number of tokens feels like giving it a lobotomy

mikeizbicki commented 1 month ago

That won't work. The point is to give the LLM a lobotomy. We are dissecting the LLM here and cutting out some small portion of it's "brain" in order to test that small functionality because we can't test the larger functionality of text generation directly.

It is possible to do RAG on masked language models. And this is actually state of the art for essentially all problems where the goal is to actually solve multiple choice questions. But that is not our goal. Our goal is to have a conversation with a model.

Another way to think about it: If I'm a history professor, I could have you write an essay about history. But that's obnoxious and hard to grade, so I'm going to give you a multiple choice test about history. I don't actually think that solving multiple choice tests are a useful skill for you to have. I'm giving you the multiple choice test only because it's easy to grade, and I hope that scores on the multiple choice test correlate with scores on things I actually care about like your ability to explain about why WWII happened. I actually care about the later, but evaluating it directly is much harder.

ains-arch commented 1 month ago

So our goal is to evaluate the specific Llama model & RAG pair (or more generally, our RAG on sequence generation tasks) and not the RAG irrespective of model/task, and so we shouldn't/don't need to assume that accuracy on a mask and refill task with one model would generalize to accuracy on a different task with a different model?

finnless commented 1 month ago

What about the hairy trumpet cases that only have 1 mask? Should we still provide the answer and get the "free" pass?

mikeizbicki commented 1 month ago

@ains-arch Yes, I think? Maybe it would be more clear if the class was defined more generically as

class RAGEvaluator:
    def __init__(self, llmfunction):
        self.llmfunction = llmfunction
    def predict(self, text):
        prompttext = # stuff
        return self.llmfunction(prompttext)

The goal is to write a class that can generically evaluate any LLM-type model for it's domain specific knowledge. Using the class above, we can compare (e.g.) the rag function we built with the run_llm function to verify that rag is in fact adding more capabilities. Alternatively, we can write a handful of different rag functions with different prompt/model/sql combinations and see which one is the most powerful. The most powerful one is then what we want to use.

@finnless I don't understand your question. Also, it turns out that all the prompts in the file that you're required to work with for the assignment contain only a single mask. Other example datasets could contain more (and so it's worth writing the function more generically to be able to handle more).

finnless commented 1 month ago

I'll rephrase. All the prompts in the required file contain only a single mask. If we are allowed to provide the only correct answer in the prompt, wouldn't we be easily able to have the response be that provided correct answer and get 100% on the test, or am I missing something? Of course, the multi option prompts would be more challenging.

mikeizbicki commented 1 month ago

@finnless You shouldn't be providing any of the correct answers. The examples I have in my prompt are fixed and don't depend on the input text. You're right that in class I happened to test it using the same prompt for convenience sake, but you should not be modifying the examples based on the input answers.

finnless commented 1 month ago

Got it. I got confused by “you should actually tell the model what the possible options are” and “you'll need to load the valid labels from the data”. I assumed that {valid_labels} was meant to be dynamic since it was included as a variable instead of being hard coded into the string.

mikeizbicki commented 1 month ago

labels is supposed to be dynamic, but based on the full dataset instead of an individual problem. Here's the relevant lines of code from my solution where I construct the variable.

    labels = set()
    with open(args.path) as fin:
        for i, line in enumerate(fin):
            dp = json.loads(line)
            labels.update(dp['masks'])
    logger.info(f'labels={labels}')

Note that I have this outside of the RAGEvaluator class and pass labels as an input into the __init__ method.

finnless commented 1 month ago

labels is supposed to be dynamic, but based on the full dataset instead of an individual problem.

Ohh, this is the part I was missing. Makes total sense now. Thank you.

mikeizbicki / cmc-csci181-languages

One last tip for the `evaluator.py` hw #19