likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
478 stars 37 forks source link

Potential Data Leakage in Probes Training #6

Closed jongjyh closed 1 year ago

jongjyh commented 1 year ago

Hello, I've been recently trying to reproduce the results from the paper, and while inspecting the code, I found a potentially incorrect implementation of cross-validation. Could you please help me verify if this issue indeed exists?

Replicate the problem

Firstly, you generated a random index in

        train_idxs = np.concatenate([fold_idxs[j] for j in range(args.num_fold) if j != i])
        test_idxs = fold_idxs[i]

        print(f"Running fold {i}")

        # pick a val set using numpy
        train_set_idxs = np.random.choice(train_idxs, size=int(len(train_idxs)*(1-args.val_ratio)), replace=False)
        val_set_idxs = np.array([x for x in train_idxs if x not in train_set_idxs])

        # save train and test splits
        df.iloc[train_set_idxs].to_csv(f"splits/fold_{i}_train_seed_{args.seed}.csv", index=False)
        df.iloc[val_set_idxs].to_csv(f"splits/fold_{i}_val_seed_{args.seed}.csv", index=False) # new index
        df.iloc[test_idxs].to_csv(f"splits/fold_{i}_test_seed_{args.seed}.csv", index=False)

and then you fetched the activation values from the saved activation file according to this newly generated index.

def get_com_directions(num_layers, num_heads, train_set_idxs, val_set_idxs, separated_head_wise_activations, separated_labels): 

    com_directions = []

    for layer in range(num_layers): 
        for head in range(num_heads): 
            usable_idxs = np.concatenate([train_set_idxs, val_set_idxs], axis=0)
            usable_head_wise_activations = np.concatenate([separated_head_wise_activations[i][:,layer,head,:] for i in usable_idxs], axis=0)
            usable_labels = np.concatenate([separated_labels[i] for i in usable_idxs], axis=0)
            true_mass_mean = np.mean(usable_head_wise_activations[usable_labels == 1], axis=0)
            false_mass_mean = np.mean(usable_head_wise_activations[usable_labels == 0], axis=0)
            com_directions.append(true_mass_mean - false_mass_mean)
    com_directions = np.array(com_directions)

    return com_directions

However, the fetched activation values do not seem to match this index, and it may potentially fetch data from test set you just split . I believe this might lead to data leakage.

likenneth commented 1 year ago

Hi there, you should provide evidence and be responsible for what you say.

The exact usable_idxs is used to index activations, how come "do not seem to match this index"?

Thanks, KL

jongjyh commented 1 year ago

Let's take an example of two-fold cross-validation.

First, the data is computed according to the truthfulQA-mc2 order in Huggingface and saved, which we call the first set of indices.

Then, the data is shuffled during training, and although the same indices are used, we call it the second set of indices because it points to completely different data in order.

    df = df.sample(frac=1, random_state=args.seed).reset_index(drop=True)

The test set is then [420, 840], and training set and val sets are [0,419]. However, there is a problem where the training set is read from the previously saved npy file using the original indices, which could cause issues.

For instance, if 1st data point is shuffled to the 450th position in the second set of indices, it should be used as a test data point. However, when we read activations, we still use the index 1 to fetch(even though it has been moved to test set) and train probes, and when we test we fetch 450th questions which is exactly the same with index 1 in .npy file and it could lead to leakage. This is my understanding of the code, which may differ from the actual execution. Please correct me if I am wrong, and I will delete this question immediately.


likenneth commented 1 year ago

Hi, thanks for detailing the problem! I just pushed an update to this repo that will sort the loaded CSV file from TruthfulQA repo to be the same as huggingface order, from which the features are saved from.

I ran some experiments and the results don't change much, perhaps because there are too few learnable parameters (~6k if K being 48) to overfit.

jongjyh commented 1 year ago

Congratulations! :)

I have a heartless request and wish you could help me. I tried to replicate the results from paper, but fail to get the results with ITI. I basicily followed the instructions of repo.


Here is what I did:

# get activations.

# validations
CUDA_VISIBLE_DEVICES=0 python3 $model --num_heads  $head --alpha $alpha --device 0 --num_fold 2 --judge_name $true  --info_name $info


I got ITI and baseline(without any intervention) results like:

Name State Notes User Tags Created Runtime Sweep activations_dataset alpha dataset_name device eval fp16 info_name judge_name model_name num_fold num_heads offline seed use_center_of_mass use_coef use_prefix use_random_dir val_ratio CE Loss Info Score KL wrt Original MC1 Score MC2 Score True Score True*Info Score
llama_7B_seed_42_top_48_heads_alpha_15_fp32 finished - jongjyh 2023-07-03T07:05:57.000Z 595 tqa_gen_end_q 15 tqa_mc2 0 TRUE FALSE curie:ft-personal-2023-06-25-10-39-37 curie:ft-personal-2023-06-25-11-44-57 llama_7B 2 48 TRUE 42 FALSE FALSE FALSE FALSE 0.2 2.13329798 0.966953713 0 0.25582782 0.405372826 0.305992018 0.295880118
llama_7B_seed_42_top_48_heads_alpha_15_com_fp32 finished - jongjyh 2023-07-02T09:23:13.000Z 5 tqa_gen_end_q 15 tqa_mc2 0 TRUE FALSE curie:ft-personal-2023-06-25-10-39-37 curie:ft-personal-2023-06-25-11-44-57 llama_7B 2 48 TRUE 42 TRUE FALSE FALSE FALSE 0.2 2.400817971 0.962048756 0.294551133 0.272975694 0.425765185 0.304835443 0.293266558

Did I miss anything?

likenneth commented 1 year ago

Hi, here is what I get from running my code with the default hyper-parameters, averaged over seed 1 through 5.

True Info MC1 MC2 CE KL
w/ ITI 0.4482981 0.92875617 0.2883893 0.45113669 2.40703174 0.26517357
w/o ITI 0.31580193 0.96695072 0.25705031 0.40542086 2.16346875 0.

From the information you gave me, it's hard to guess what you have missed, isn't it? But anyways, hope you agree that the data leakage problem has been fixed.

jongjyh commented 1 year ago

Sure, thank you for your quickly following! it's an interesting work indeed! : )

A-Raafat commented 1 year ago

Hi, here is what I get from running my code with the default hyper-parameters, averaged over seed 1 through 5.

True Info MC1 MC2 CE KL w/ ITI 0.4482981 0.92875617 0.2883893 0.45113669 2.40703174 0.26517357 w/o ITI 0.31580193 0.96695072 0.25705031 0.40542086 2.16346875 0. From the information you gave me, it's hard to guess what you have missed, isn't it? But anyways, hope you agree that the data leakage problem has been fixed.

Hello, how do you get the results for w/o ITI, do you manually put intervensions = {} in alt_tqa_evaluate function?

Also i have another question, how do you save the new model after changing the activations direction ?