nitaytech / DoCoGen

22 stars 2 forks source link

KeyError: 'Unknown' #6

Open amdongyang opened 2 years ago

amdongyang commented 2 years ago

When i use my own data, i simply set two domain ('news', 'talk'), and 'news' contains about 2,200,000 sentences, 'talk' contains about 450,000 sentences. The keys 'split' is set to 'unlabeled', 'train', 'validation', the proportion is 9 : 0.5 : 0.5 ('unlabeled : train : validation). The configs is set the same as "reviews.json", when i run the code to train a generator, the error below occurs, i don't know what cause this error.

100%|██████████| 508/508 [4:43:04<00:00, 33.43s/it]
Traceback (most recent call last): File "code/main.py", line 31, in build_train_generate(args.configs_path) File "code/main.py", line 11, in build_train_generate configs = build_datasets(configs=configs, File "/home/lyfan/robust_nmt/DoCoGen/code/configs_and_pipelines/pipelines.py", line 101, in build_datasets language_masker = language_masker_runner(dataset=training_dataset, configs=configs) File "/home/lyfan/robust_nmt/DoCoGen/code/configs_and_pipelines/pipelines.py", line 68, in language_masker_runner language_masker.add_orientations_top_words(n_orientations=configs.n_orientations, File "/home/lyfan/robust_nmt/DoCoGen/code/modeling/masking/masker.py", line 377, in add_orientations_top_words top_words = self.top_values_words(n_orientations, with_value_name, top_occurrences_threshold, concepts) File "/home/lyfan/robust_nmt/DoCoGen/code/modeling/masking/masker.py", line 360, in top_values_words top_words[first_concept][first_value].append((word, score)) KeyError: 'Unknown'

nitaytech commented 2 years ago

I will need your files in order to reproduce this error. Can you please try running the docogen_lite.py script?

amdongyang commented 2 years ago

OK,I can git you the python file that is used process the data(The data is for translation task, and i only use the src).

import json from collections import defaultdict

idx_v7 = 0 idx_ted = 0 dic = defaultdict(list) with open("data/wmt14/europarl-v7.en", 'r') as v7en_fl: for line in v7en_fl.readlines(): dic['domain'].append('news') if idx_v7 < 1800000: dic['split'].append('unlabeled')
elif 1800000 < idx_v7 < 2000000: dic['split'].append('train') else: dic['split'].append('validation') dic['text'].append(line.strip("\n")) idx_v7 += 1 print("processing {}th wmt data".format(idx_v7))

with open('data/tedtalk/mono_en.txt', 'r') as tedtalk_fl: for line in tedtalk_fl.readlines(): dic['domain'].append('talk') if idx_ted < 370000: dic['split'].append('unlabeled') elif 370000 < idx_ted < 410000: dic['split'].append('train') else: dic['split'].append('validation') dic['text'].append(line.strip("\n")) idx_ted += 1 print("processing {}th ted data".format(idx_ted))

with open("data/wmt_ted.json", 'w') as fl: fl.write(json.dumps(dic))

Where the europarl-v7.en and mono_en.txt' only contain monolingual sentences.

I will give a try running the docogen_lite.py script.