jongjyh commented 1 year ago

Hello, I've been recently trying to reproduce the results from the paper, and while inspecting the code, I found a potentially incorrect implementation of cross-validation. Could you please help me verify if this issue indeed exists?

Replicate the problem

Firstly, you generated a random index in validate_2fold.py

        train_idxs = np.concatenate([fold_idxs[j] for j in range(args.num_fold) if j != i])
        test_idxs = fold_idxs[i]

        print(f"Running fold {i}")

        # pick a val set using numpy
        train_set_idxs = np.random.choice(train_idxs, size=int(len(train_idxs)*(1-args.val_ratio)), replace=False)
        val_set_idxs = np.array([x for x in train_idxs if x not in train_set_idxs])

        # save train and test splits
        df.iloc[train_set_idxs].to_csv(f"splits/fold_{i}_train_seed_{args.seed}.csv", index=False)
        df.iloc[val_set_idxs].to_csv(f"splits/fold_{i}_val_seed_{args.seed}.csv", index=False) # new index
        df.iloc[test_idxs].to_csv(f"splits/fold_{i}_test_seed_{args.seed}.csv", index=False)

and then you fetched the activation values from the saved activation file according to this newly generated index.

def get_com_directions(num_layers, num_heads, train_set_idxs, val_set_idxs, separated_head_wise_activations, separated_labels): 

    com_directions = []

    for layer in range(num_layers): 
        for head in range(num_heads): 
            usable_idxs = np.concatenate([train_set_idxs, val_set_idxs], axis=0)
            usable_head_wise_activations = np.concatenate([separated_head_wise_activations[i][:,layer,head,:] for i in usable_idxs], axis=0)
            usable_labels = np.concatenate([separated_labels[i] for i in usable_idxs], axis=0)
            true_mass_mean = np.mean(usable_head_wise_activations[usable_labels == 1], axis=0)
            false_mass_mean = np.mean(usable_head_wise_activations[usable_labels == 0], axis=0)
            com_directions.append(true_mass_mean - false_mass_mean)
    com_directions = np.array(com_directions)

    return com_directions

However, the fetched activation values do not seem to match this index, and it may potentially fetch data from test set you just split . I believe this might lead to data leakage.

likenneth commented 1 year ago

Hi there, you should provide evidence and be responsible for what you say.

The exact usable_idxs is used to index activations, how come "do not seem to match this index"?

Thanks, KL

jongjyh commented 1 year ago

Let's take an example of two-fold cross-validation.

First, the data is computed according to the truthfulQA-mc2 order in Huggingface and saved, which we call the first set of indices.

Then, the data is shuffled during training, and although the same indices are used, we call it the second set of indices because it points to completely different data in order.

    df = df.sample(frac=1, random_state=args.seed).reset_index(drop=True)

The test set is then [420, 840], and training set and val sets are [0,419]. However, there is a problem where the training set is read from the previously saved npy file using the original indices, which could cause issues.

For instance, if 1st data point is shuffled to the 450th position in the second set of indices, it should be used as a test data point. However, when we read activations, we still use the index 1 to fetch(even though it has been moved to test set) and train probes, and when we test we fetch 450th questions which is exactly the same with index 1 in .npy file and it could lead to leakage. This is my understanding of the code, which may differ from the actual execution. Please correct me if I am wrong, and I will delete this question immediately.

Thanks.

likenneth commented 1 year ago

Hi, thanks for detailing the problem! I just pushed an update to this repo that will sort the loaded CSV file from TruthfulQA repo to be the same as huggingface order, from which the features are saved from.

I ran some experiments and the results don't change much, perhaps because there are too few learnable parameters (~6k if K being 48) to overfit.

jongjyh commented 1 year ago

Congratulations! :)

I have a heartless request and wish you could help me. I tried to replicate the results from paper, but fail to get the results with ITI. I basicily followed the instructions of repo.

code

Here is what I did:

# get activations.
CUDA_VISIBLE_DEVICES=3 HF_DATASETS_OFFLINE=1 python3.8 get_activations.py llama_7B tqa_mc2

# validations
model="llama_7B"
CUDA_VISIBLE_DEVICES=0 python3 validate_2fold.py $model --num_heads  $head --alpha $alpha --device 0 --num_fold 2 --judge_name $true  --info_name $info

results

I got ITI and baseline(without any intervention) results like:

Name	State	Notes	User	Tags	Created	Runtime	Sweep	activations_dataset	alpha	dataset_name	device	eval	fp16	info_name	judge_name	model_name	num_fold	num_heads	offline	seed	use_center_of_mass	use_coef	use_prefix	use_random_dir	val_ratio	CE Loss	Info Score	KL wrt Original	MC1 Score	MC2 Score	True Score	True*Info Score
llama_7B_seed_42_top_48_heads_alpha_15_fp32	finished	-	jongjyh		2023-07-03T07:05:57.000Z	595		tqa_gen_end_q	15	tqa_mc2	0	TRUE	FALSE	curie:ft-personal-2023-06-25-10-39-37	curie:ft-personal-2023-06-25-11-44-57	llama_7B	2	48	TRUE	42	FALSE	FALSE	FALSE	FALSE	0.2	2.13329798	0.966953713	0	0.25582782	0.405372826	0.305992018	0.295880118
llama_7B_seed_42_top_48_heads_alpha_15_com_fp32	finished	-	jongjyh		2023-07-02T09:23:13.000Z	5		tqa_gen_end_q	15	tqa_mc2	0	TRUE	FALSE	curie:ft-personal-2023-06-25-10-39-37	curie:ft-personal-2023-06-25-11-44-57	llama_7B	2	48	TRUE	42	TRUE	FALSE	FALSE	FALSE	0.2	2.400817971	0.962048756	0.294551133	0.272975694	0.425765185	0.304835443	0.293266558

Did I miss anything?
Thanks!

likenneth commented 1 year ago

Hi, here is what I get from running my code with the default hyper-parameters, averaged over seed 1 through 5.

	True	Info	MC1	MC2	CE	KL
w/ ITI	0.4482981	0.92875617	0.2883893	0.45113669	2.40703174	0.26517357
w/o ITI	0.31580193	0.96695072	0.25705031	0.40542086	2.16346875	0.

From the information you gave me, it's hard to guess what you have missed, isn't it? But anyways, hope you agree that the data leakage problem has been fixed.

jongjyh commented 1 year ago

Sure, thank you for your quickly following! it's an interesting work indeed! : )

A-Raafat commented 1 year ago

Hi, here is what I get from running my code with the default hyper-parameters, averaged over seed 1 through 5.

True Info MC1 MC2 CE KL w/ ITI 0.4482981 0.92875617 0.2883893 0.45113669 2.40703174 0.26517357 w/o ITI 0.31580193 0.96695072 0.25705031 0.40542086 2.16346875 0. From the information you gave me, it's hard to guess what you have missed, isn't it? But anyways, hope you agree that the data leakage problem has been fixed.

Hello, how do you get the results for w/o ITI, do you manually put intervensions = {} in alt_tqa_evaluate function?

Also i have another question, how do you save the new model after changing the activations direction ?

likenneth / honest_llama

Potential Data Leakage in Probes Training #6

Replicate the problem

code

results