Open shengzhang90 opened 4 years ago
you can try my code blew,you can call it use file parameters like that ' convert_cws_format('{you path}/icwb2-data/training/pku_training.utf8','{your path}/pku_training.txt')':
def convert_cws_format(ori_file,tsv_file):
tag_dict = {"begin":"B",
"inside":"I",
"end":"E",
"single":"S"
}
with open(tsv_file,'w',encoding="utf8") as wf:
with open(ori_file,'r',encoding="utf8") as rf:
for line in rf:
token_list = line.strip('\n').split(' ')
cut_space_sen = line.replace(" ","")
tag_list = []
for token in token_list:
if len(token) == 1:
tag_list.append(tag_dict['single'])
elif len(token) > 1:
token_len = len(token)
while token_len>1:
if token_len == len(token):
tag_list.append(tag_dict['begin'])
else:
tag_list.append(tag_dict['inside'])
token_len -= 1
tag_list.append(tag_dict['end'])
tag_str = "".join(tag_list)
assert len(cut_space_sen.strip()) == len(tag_str)
wf.write(cut_space_sen.strip()+'\t'+tag_str+'\n')
Firstly, thanks a lot for your open source contribution. Could you please provide some Python scripts for converting the originally official datasets format to the TSV format ? For example, XML to TSV for the NER task of MSRA, ...... therefore, we can use your project much more conveniently.
Thanks a lot again.