Step 4. Format to Simpler Json Files

fatmas1982 commented 5 years ago

Could you clear to what should the Step 4. Format to Simpler Json Files do . my case : i have my own data-set . i am trying to apply these steps on it. Now I performed to Step 3. Sentence Splitting and Tokenization and generated Json files . regarding my own data-set step 4 did not perform any thing. after studying the code related to step 4 in function called format_to_lines -->data_builder.py . this function compare my json file by name with mapping file with the same name in URL directory. I think the isse in this loop

for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

the corpus_mapping[corpus_type] length are

corpus_mapping valid 13368 corpus_mapping test 11490 corpus_mapping train 287227

train_files,` valid_files, test_files = [], [], []
print("glob jason",glob.glob(pjoin(args.raw_path, '*.json')))
for f in glob.glob(pjoin(args.raw_path, '*.json')):
print("f",f)
real_name = f.split('/')[-1].split('.')[0]
print("real_name",real_name)
if (real_name in corpus_mapping['valid']):
valid_files.append(f)
elif (real_name in corpus_mapping['test']):
test_files.append(f)
elif (real_name in corpus_mapping['train']):
train_files.append(f)
# else:
#     train_files.append(f)
print("len train_files, valid_files, test_files ",len(train_files), len(valid_files), len(test_files ))

len train_files, valid_files, test_files 0 0 0 could you help me ?

cuthbertjohnkarawa commented 4 years ago

@fatmas1982 did you manage to resolve the issue

fatmas1982 commented 4 years ago

No

NandaKishoreJoshi commented 4 years ago

Even I'm facing the same issue. Let me know if anyone know the solution to this issue or any other way to preprocess the data

fatmas1982 commented 4 years ago

I dont know tell now

Ghani-25 commented 4 years ago

same problem

mmcmahon13 commented 4 years ago

I am also running into issues trying to preprocess my own data for fine-tuning, I'm not sure how i should format my mapping files for custom data

mmcmahon13 commented 4 years ago

I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps

AanchalA commented 4 years ago

I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps

I tried removing the call to hashhex in temp.append(hashhex(line.strip())); but there is no difference, I'm still getting nothing

AanchalA commented 4 years ago

For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '\' instead of '/' in its path. So, in format_to_lines(args) when i changed real_name = f.split('/')[-1].split('.')[0] to real_name = f.split('\\')[-1].split('.')[0]. It worked for me.

imJiawen commented 4 years ago

Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:

    # build the corpus_mapping dict according to the files in urls
    corpus_mapping = {}
    for corpus_type in ['valid', 'test', 'train']:
        temp = []
        for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

    train_files, valid_files, test_files = [], [], []
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        #since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)

Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].

For me, I removed the if (real_name in corpus_mapping['XXX']): conditions, and set a ratio for data splitting. e.g.

    cur = 0
    valid_test_ratio = 0.01
    all_size = len(glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        if (cur < valid_test_ratio*all_size):
            valid_files.append(f)
        elif (cur < valid_test_ratio*2*all_size):
            test_files.append(f)
        else:
            train_files.append(f)
        cur += 1

It works for me. :)

kush-2418 commented 2 years ago

Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:

    # build the corpus_mapping dict according to the files in urls
    corpus_mapping = {}
    for corpus_type in ['valid', 'test', 'train']:
        temp = []
        for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

    train_files, valid_files, test_files = [], [], []
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        #since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)

Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].

For me, I removed the if (real_name in corpus_mapping['XXX']): conditions, and set a ratio for data splitting. e.g.

    cur = 0
    valid_test_ratio = 0.01
    all_size = len(glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        if (cur < valid_test_ratio*all_size):
            valid_files.append(f)
        elif (cur < valid_test_ratio*2*all_size):
            test_files.append(f)
        else:
            train_files.append(f)
        cur += 1

It works for me. :)

Worked for me thanks !!!

WSChange commented 6 months ago

For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '\' instead of '/' in its path. So, in format_to_lines(args) when i changed real_name = f.split('/')[-1].split('.')[0] to real_name = f.split('\\')[-1].split('.')[0]. It worked for me.

thanks! It work in my project!

nlpyang / PreSumm

Step 4. Format to Simpler Json Files #43