Open amdongyang opened 2 years ago
I will need your files in order to reproduce this error. Can you please try running the docogen_lite.py script?
OK,I can git you the python file that is used process the data(The data is for translation task, and i only use the src).
import json from collections import defaultdict
idx_v7 = 0
idx_ted = 0
dic = defaultdict(list)
with open("data/wmt14/europarl-v7.en", 'r') as v7en_fl:
for line in v7en_fl.readlines():
dic['domain'].append('news')
if idx_v7 < 1800000:
dic['split'].append('unlabeled')
elif 1800000 < idx_v7 < 2000000:
dic['split'].append('train')
else:
dic['split'].append('validation')
dic['text'].append(line.strip("\n"))
idx_v7 += 1
print("processing {}th wmt data".format(idx_v7))
with open('data/tedtalk/mono_en.txt', 'r') as tedtalk_fl: for line in tedtalk_fl.readlines(): dic['domain'].append('talk') if idx_ted < 370000: dic['split'].append('unlabeled') elif 370000 < idx_ted < 410000: dic['split'].append('train') else: dic['split'].append('validation') dic['text'].append(line.strip("\n")) idx_ted += 1 print("processing {}th ted data".format(idx_ted))
with open("data/wmt_ted.json", 'w') as fl: fl.write(json.dumps(dic))
Where the europarl-v7.en and mono_en.txt' only contain monolingual sentences.
I will give a try running the docogen_lite.py script.
When i use my own data, i simply set two domain ('news', 'talk'), and 'news' contains about 2,200,000 sentences, 'talk' contains about 450,000 sentences. The keys 'split' is set to 'unlabeled', 'train', 'validation', the proportion is 9 : 0.5 : 0.5 ('unlabeled : train : validation). The configs is set the same as "reviews.json", when i run the code to train a generator, the error below occurs, i don't know what cause this error.
100%|██████████| 508/508 [4:43:04<00:00, 33.43s/it]
build_train_generate(args.configs_path)
File "code/main.py", line 11, in build_train_generate
configs = build_datasets(configs=configs,
File "/home/lyfan/robust_nmt/DoCoGen/code/configs_and_pipelines/pipelines.py", line 101, in build_datasets
language_masker = language_masker_runner(dataset=training_dataset, configs=configs)
File "/home/lyfan/robust_nmt/DoCoGen/code/configs_and_pipelines/pipelines.py", line 68, in language_masker_runner
language_masker.add_orientations_top_words(n_orientations=configs.n_orientations,
File "/home/lyfan/robust_nmt/DoCoGen/code/modeling/masking/masker.py", line 377, in add_orientations_top_words
top_words = self.top_values_words(n_orientations, with_value_name, top_occurrences_threshold, concepts)
File "/home/lyfan/robust_nmt/DoCoGen/code/modeling/masking/masker.py", line 360, in top_values_words
top_words[first_concept][first_value].append((word, score))
KeyError: 'Unknown'
Traceback (most recent call last): File "code/main.py", line 31, in