sinovation / ZEN

A BERT-based Chinese Text Encoder Enhanced by N-gram Representations
Apache License 2.0
641 stars 104 forks source link

Fine-tuning datasets preparation #17

Open shengzhang90 opened 4 years ago

shengzhang90 commented 4 years ago

Firstly, thanks a lot for your open source contribution. Could you please provide some Python scripts for converting the originally official datasets format to the TSV format ? For example, XML to TSV for the NER task of MSRA, ...... therefore, we can use your project much more conveniently.

Thanks a lot again.

ChristopheZhao commented 2 years ago

you can try my code blew,you can call it use file parameters like that ' convert_cws_format('{you path}/icwb2-data/training/pku_training.utf8','{your path}/pku_training.txt')':

def convert_cws_format(ori_file,tsv_file):

    tag_dict = {"begin":"B",
                "inside":"I",
                "end":"E",
                "single":"S"
                }

    with open(tsv_file,'w',encoding="utf8") as wf:
        with open(ori_file,'r',encoding="utf8") as rf:
            for line in rf:
                token_list = line.strip('\n').split(' ')
                cut_space_sen = line.replace(" ","")
                tag_list = []
                for token in token_list:
                    if len(token) == 1:
                        tag_list.append(tag_dict['single'])
                    elif len(token) > 1:
                        token_len = len(token)
                        while token_len>1:
                            if token_len == len(token):
                                tag_list.append(tag_dict['begin'])
                            else:
                                tag_list.append(tag_dict['inside'])
                            token_len -= 1
                        tag_list.append(tag_dict['end'])

                tag_str = "".join(tag_list)
                assert len(cut_space_sen.strip()) == len(tag_str)
                wf.write(cut_space_sen.strip()+'\t'+tag_str+'\n')