Issues in personalized task

innovativeC commented 3 years ago

Hi guys, PET is a great concept that you have introduced and I am really excited to be working on it. I am however facing an error while running a personalized task -

python cli.py --method pet --pattern_ids 0 --data_dir D:\personal_projects --model_type albert --model_name_or_path albert-xxlarge-v2 --task_name pycode-generation --output_dir D:\personal_projects\output --do_train --do_eval 2020-12-17 17:22:45,587 - INFO - cli - Parameters: Namespace(adam_epsilon=1e-08, alpha=0.9999, cache_dir='', data_dir='D:\\personal_projects', decoding_strategy='default', do_eval=True, do_train=True, eval_set='dev', ipet_generations=3, ipet_logits_percentage=0.25, ipet_n_most_likely=-1, ipet_scale_factor=5, learning_rate=1e-05, lm_training=False, logging_steps=50, max_grad_norm=1.0, method='pet', model_name_or_path='albert-xxlarge-v2', model_type='albert', no_cuda=False, no_distillation=False, output_dir='D:\\personal_projects\\output', overwrite_output_dir=False, pattern_ids=[0], pet_gradient_accumulation_steps=1, pet_max_seq_length=256, pet_max_steps=-1, pet_num_train_epochs=3, pet_per_gpu_eval_batch_size=8, pet_per_gpu_train_batch_size=4, pet_per_gpu_unlabeled_batch_size=4, pet_repetitions=3, priming=False, reduction='wmean', sc_gradient_accumulation_steps=1, sc_max_seq_length=256, sc_max_steps=-1, sc_num_train_epochs=3, sc_per_gpu_eval_batch_size=8, sc_per_gpu_train_batch_size=4, sc_per_gpu_unlabeled_batch_size=4, sc_repetitions=1, seed=42, split_examples_evenly=False, task_name='pycode-generation', temperature=2, test_examples=-1, train_examples=-1, unlabeled_examples=-1, verbalizer_file=None, warmup_steps=0, weight_decay=0.01, wrapper_type='mlm') 2020-12-17 17:22:45,590 - INFO - tasks - Creating features from dataset file at D:\personal_projects (num_examples=-1, set_type=train) 2020-12-17 17:22:45,597 - INFO - tasks - Returning 26 train examples with label dist.: [('code', 1), ('1', 7), ('2', 7), ('3', 5), ('4', 6)] 2020-12-17 17:22:45,597 - INFO - tasks - Creating features from dataset file at D:\personal_projects (num_examples=-1, set_type=dev) 2020-12-17 17:22:45,600 - INFO - tasks - Returning 16 dev examples with label dist.: [('code', 1), ('1', 4), ('3', 3), ('2', 4), ('4', 4)] 2020-12-17 17:22:45,601 - INFO - tasks - Creating features from dataset file at D:\personal_projects (num_examples=-1, set_type=unlabeled) 2020-12-17 17:22:45,605 - INFO - tasks - Returning 6 unlabeled examples with label dist.: [('1', 6)] 2020-12-17 17:22:55,331 - INFO - wrapper - Writing example 0 Traceback (most recent call last): File "cli.py", line 455, in <module> main() File "cli.py", line 436, in main no_distillation=args.no_distillation, seed=args.seed) File "D:\personal_projects\pet\modeling.py", line 249, in train_pet save_unlabeled_logits=not no_distillation, seed=seed) File "D:\personal_projects\pet\modeling.py", line 355, in train_pet_ensemble unlabeled_data=unlabeled_data)) File "D:\personal_projects\pet\modeling.py", line 434, in train_single_model results_dict['train_set_before_training'] = evaluate(model, train_data, eval_config)['scores']['acc'] File "D:\personal_projects\pet\modeling.py", line 490, in evaluate n_gpu=config.n_gpu, decoding_strategy=config.decoding_strategy, priming=config.priming) File "D:\personal_projects\pet\wrapper.py", line 352, in eval eval_dataset = self._generate_dataset(eval_data, priming=priming) File "D:\personal_projects\pet\wrapper.py", line 399, in _generate_dataset features = self._convert_examples_to_features(data, labelled=labelled, priming=priming) File "D:\personal_projects\pet\wrapper.py", line 424, in _convert_examples_to_features input_features = self.preprocessor.get_input_features(example, labelled=labelled, priming=priming) File "D:\personal_projects\pet\preprocessor.py", line 83, in get_input_features label = self.label_map[example.label] if example.label is not None else -100 KeyError: 'code'

Could you explain me on why am I getting this key error ?

timoschick commented 3 years ago

Hi @innovativeC, the label map (that throws the KeyError) is initialized as follows:

self.label_map = {label: i for i, label in enumerate(self.wrapper.config.label_list)}

where self.wrapper.config.label_list is the list of labels that your TaskProcessors get_labels() method returns. You are getting this error because one of your training examples has the label 'code', but this label is not one of the labels defined in your TaskProcessor. See also here.

innovativeC commented 3 years ago

@timoschick , So what is the difference between Label_Column and Labels, which have to be initialized in my task processor ? I am asking this since my target label column name is "code" and the list of labels in "code" column are ['1','2','3','4']

timoschick commented 3 years ago

Hi @innovativeC, I guess you are referring to the variables in this example? This example assumes that your CSV file has no header, so it may well be that your file's header is processed as if it were a regular example, and thus the script assumes that 'code' is also a label (while in fact, it is the header of the column that contains the labels).

If this is indeed the issue, you may simply skip the first line when processing your input file, for example by adding

if idx == 0:
    continue

after line 101 in the example script.

timoschick / pet

Issues in personalized task #18