Observing eval accuracy considerably lower than reported?

knagrecha commented 10 months ago

Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?

knagrecha commented 10 months ago

As an example:

sciq - GPT-2-medium reports accuracy of 0.43 (0.5 after I lowered the learning rate). LLaMA-7B ground truth got 0.84. LLaMA-7b-transferred got 0.43 (0.81 after I lowered the learning rate).

WuTheFWasThat commented 10 months ago

0.43 is worse than random so something is either wrong with the ML there or your eval set isn't big enough

knagrecha commented 10 months ago

Yeah I figured the eval set size seemed small but assumed that the line in the README would work directly. Might test it out again later with a larger eval size.

knagrecha commented 10 months ago

10X'd the train/test sizes. new results on sciq with gpt2-med and llama-7b after a quick run.

GPT-2-Med ending acc: 0.661 +/- 0.006694460396477075

LLaMA-7B ending acc (gt): 0.866 +/- 0.015234434679370284

LLaMA-7B ending acc (transfer): Accuracy: 0.704 +/- 0.020414896521902825

Looks nice! Pretty closely aligned with the Qwen results, with slightly lower transfer efficacy. Hope others will add their OSS model eval results soon too.

Would suggest increasing the n_docs/n_test_docs values in the README command? Current values seem pretty low.

WuTheFWasThat commented 10 months ago

haha yeah, they are low! can update that

things to generally keep in mind:

things are somewhat noisy in general, even with a large dataset. results are cleaner when averaging across many seeds. i'm not totally sure why but i think they're noisier than our internal setup was
truncating the dataset to be smaller makes things even noisier

knagrecha commented 10 months ago

Off-topic, but I am curious about how you guys are thinking of labeling by a weak supervisor vs criticism/scoring by a weak supervisor. I guess there can be an argument in both directions, whether labeling is easier for a weak model or criticism.

knagrecha commented 10 months ago

I guess criticism may introduce even more noise due to hallucinations, but if alignment is from the perspective of a “weaker human” to strong model, it may intuitively be easier than labeling.

agokrani commented 10 months ago

I am having the same issue of noise on my side, could it be possible that this is because of the way classification head was initialized. The paper claims to initialize the head with embedding weights of token "0" and "1" whereas in the code it seems like we are initializing it differently.

WuTheFWasThat commented 10 months ago

I actually tried initializing using unembeddings, and it didn't seem to help. but I didn't test very extensively. my hunch is it's not the issue.

by the way, there is some substantial literature on noisiness of fine-tuning, e.g. https://arxiv.org/pdf/2002.06305.pdf

agokrani commented 10 months ago

Will look at this, thanks a lot. It would be nice to know how did you initialize with unembedding weights.

WuTheFWasThat commented 10 months ago

here's the code i had used, had done it sort of hackily

        # NOTE: this has to happen after the rest of the model is initialized
        unemb = self.transformer.wte.weight.data
        assert self.num_labels == 2
        inds = [
            11491, # incorrect
            3376, # correct
        ]
        new_data = unemb[inds, :]
        self.score.weight.data.copy_(new_data)

agokrani commented 10 months ago

Thank you so much @WuTheFWasThat, will test it

openai / weak-to-strong

Observing eval accuracy considerably lower than reported? #3