Texts-Together-OneCSVperFile are not in UTF-8

gsarti commented 4 years ago

I am experiencing some difficulties when loading the files using python's Pandas library since they do not appear to be in the standard utf-8 format.

I tried to use the charade to determine the original encoding and the suggested one is ISO-8859-2, but even converting from ISO-8859-2 to UTF-8 produces wrong characters.

Is it possible to address this? It would just require to convert the set of CSV to UTF-8 and replace the ones currently on Zenodo.

brucewlee commented 2 years ago

Just another user passing by. UTF-8 and ascii don't work for sure. cp-1252 is your best bet. Delete WNL Scarlett file before you run code.

But be minded that cp-1252 also produces wrong characters.

mnbucher commented 1 year ago

ran into the same issue today. hasn't been solved as far as i can see. really nasty to see some noise introduced into the dataset due to misaligned encodings :(

brucewlee commented 1 year ago

On TextsSeparatedByReading Level, try this script. This will give the output files that you need. @mnbucher

from glob import glob
import csv
import os
import pandas as pd
import random
import shutil
from collections import defaultdict

path_list = [("Adv-Txt",3),
            ("Int-Txt",2),
            ("Ele-Txt",1)
            ]

# num_of_texts_per_grade = 5

def txt2csv(path,level):
    processed_list = []
    index = 0
    output_file_name = "output/" + path + ".csv"
    output = open(output_file_name, "w")
    writemachine = csv.writer(output)
    for fname in glob("OneStop_Raw/"+ path +"/*.txt"):
        index += 1
        print("Analyzing", fname)
        target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
        target_text = target_file.read()
        target_text = target_text.encode("ascii","ignore")
        target_text = target_text.decode()
        print("Adding", fname)
        writemachine.writerow([index, target_text,str(level),path])
        processed_list.append([index, target_text,str(level),path])
        target_file.close()
    output.close()
    return index, processed_list

def pairwise_txt2csv(path_list):
    adv_list = []
    int_list = []
    ele_list = []
    all_list = []
    index1 = 0
    index2 = 0
    index3 = 0
    all_index = 0
    for fname in glob("OneStop_Raw/Adv-Txt/*.txt"):
        row_dict = defaultdict()
        index1 += 1
        print("Analyzing", fname)
        target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
        target_text = target_file.read()
        target_text = target_text.encode("ascii","ignore")
        target_text = target_text.decode()
        print("Adding", fname)
        adv_list.append((" ".join(target_text.split()),fname))
        target_file.close()
    for fname in glob("OneStop_Raw/Int-Txt/*.txt"):
        row_dict = defaultdict()
        index2 += 1
        print("Analyzing", fname)
        target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
        target_text = target_file.read()
        target_text = target_text.encode("ascii","ignore")
        target_text = target_text.decode()
        print("Adding", fname)
        int_list.append((" ".join(target_text.split()),fname))
        target_file.close()
    for fname in glob("OneStop_Raw/Ele-Txt/*.txt"):
        row_dict = defaultdict()
        index3 += 1
        print("Analyzing", fname)
        target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
        target_text = target_file.read()
        target_text = target_text.encode("ascii","ignore")
        target_text = target_text.decode()
        print("Adding", fname)
        ele_list.append((" ".join(target_text.split()),fname))
        target_file.close()
    if index1 == index2 == index3:
        print ("LENGTH SAME")

    """sort by file name alphabetical order to ensure paraphrases of the same text is paired"""
    adv_list.sort(key=lambda tup: tup[1])
    int_list.sort(key=lambda tup: tup[1])
    ele_list.sort(key=lambda tup: tup[1])

    for idx in range(len(adv_list)):
        all_index+=1
        all_list.append({"3":adv_list[idx][0], "2":int_list[idx][0], "1":ele_list[idx][0], "f3":adv_list[idx][1], "f2":int_list[idx][1], "f1":ele_list[idx][1]})

        print(adv_list[idx][1], int_list[idx][1], ele_list[idx][1])
    print("pairwise number of texts:"+str(all_index))
    return all_index, all_list

def final_output(index, processed_list):
    balanced_list = random.sample(processed_list, len(processed_list))
    print("...final_output...")
    output = open("output/final_output.csv", "a")
    writemachine = csv.writer(output)
    for row in balanced_list:
        writemachine.writerow(row)
        print("writing..." + str(row[0]))

def pairwise_final_output(index, processed_list):
    this_df = pd.DataFrame(processed_list)

if __name__ == '__main__':
    try:
        shutil.rmtree("output")
        os.mkdir("output")
    except:
        os.mkdir("output")
    output_file = open("output/final_output.csv", "w")
    for path in path_list:
        index, processed_list = txt2csv(path[0],path[1])
        final_output(index, processed_list)
    pairwise_index, pairwise_processed_list = pairwise_txt2csv(path_list)
    pairwise_final_output(pairwise_index,pairwise_processed_list)
    count_1,count_2,count_3 = count_texts()
    print ("l1:"+ str(count_1)+"\n"+
            "l2:"+ str(count_2)+"\n"+
            "l3:"+ str(count_3))

mnbucher commented 1 year ago

hi @brucewlee thanks for the code. i managed to write my own version and hope i minimized the noise of the encoding in it. currently confused by the dataset size though. the paper states that "The corpus consists of 189 texts, each in three versions (567 in total". but i actually have 7278 samples after parsing all TXTs. has the dataset been massively extended after the original publication?

brucewlee commented 1 year ago

@mnbucher No. In my experience, the corpus size is the same as stated in the paper. In the above script, see txt2csv instead of pairwise_txt2csv. The latter creates pairwise instances.

nishkalavallabhi / OneStopEnglishCorpus

Texts-Together-OneCSVperFile are not in UTF-8 #4