toriving / KoEDA

Korean Easy Data Augmentation
MIT License
93 stars 6 forks source link
a-easier-data-augmentation data-augmentation easy-data-augmentation koeda korean korean-nlp natural-language-processing nlp python


Deploy Test Release Black

Easy Data Augmentation for Korean

This is a project that re-implemented Easy data augmentation and A Easier Data Augmentation, which were implemented for English, to fit Korean.



This repository is tested on Python 3.7 - 3.9.

KoEDA can be installed using pip as follows:

$ pip install koeda

Quick Start

eda = EDA( morpheme_analyzer="Okt", alpha_sr=0.3, alpha_ri=0.3, alpha_rs=0.3, prob_rd=0.3 )

text = "아버지가 방에 들어가신다"

result = eda(text) print(result)

아버지가 정실에 들어가신다

result = eda(text, p=(0.9, 0.9, 0.9, 0.9), repetition=2) print(result)

['아버지가 객실 아빠 안방 방에 정실 들어가신다', '아버지가 탈의실 방 휴게실 에 안방 탈의실 들어가신다']

from koeda import AEDA

aeda = AEDA(
    morpheme_analyzer="Okt", punc_ratio=0.3, punctuations=[".", ",", "!", "?", ";", ":"]

text = "어머니가 집을 나가신다"

result = aeda(text)
# 어머니가 ! 집을 , 나가신다

result = aeda(text, p=0.9, repetition=2)
# ['! 어머니가 ! 집 ; 을 ? 나가신다', '. 어머니 ? 가 . 집 , 을 , 나가신다']


There are two ways to load Augmenter.

The first is to use the full name.

from koeda import EasyDataAugmentation

The second is to use abbreviations.

from koeda import EDA


result = augmenter( data: Union[List[str], str], p: List[float] = None, # Default = (0.1, 0.1, 0.1, 0.1) repetition: int = 1 )

augmenter = AEDA(
              morpheme_analyzer: str = None,  # Default = "Okt"
              punc_ratio: float = 0.3,
              punctuations: List[str] = None  # default = ('.', ',', '!', '?', ';', ':')

result = augmenter(
            data: Union[List[str], str], 
            p: float = None,  # Default = 0.3 
            repetition: int = 1

augmenter = RI( morpheme_analyzer: str = None, stopword: bool = False )

augmenter = SR( morpheme_analyzer: str = None, stopword: bool = False )

augmenter = RS( morpheme_analyzer: str = None, )

result = augmenter( data: Union[List[str], str], p: float = 0.1, repetition: int = 1 )

## Reference
[Easy Data Augmentation Paper](  
[Easy Data Augmentation Repository](  
[A Easier Data Augmentation Paper](  
[A Easier Data Augmentation Repository](  
[Korean WordNet](