the data generated has too many <unk>

ntunlp / daga

Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

MIT License

79 stars 15 forks source link

the data generated has too many <unk> #12

Open LNdoremi opened 2 years ago

LNdoremi commented 2 years ago

Hi, I am using your method to generate synthetic data for NER, the dataset I use is the conll++ and conll03, but I found that the output data has over 10,000 tokens. Some of them are even given a ner tag. I hope if you could give me some tips on solving this issue.

Bosheng2020 commented 2 years ago

Hi, you can filter the generated data by using some rules, e.g. remove those generated data that have invalid NER tags. You can also use a NER model to filter the generated data. Please refer to Section 2.4 in this paper: https://aclanthology.org/2021.acl-long.453.pdf. To reduce the number of , you can also adjust the criteria to replace the tokens with .