Find/Create function for text tokenzation

AndreyKondakovGW commented 3 years ago

Find/Create function for sepatating text string to tokens(words). Function must get text string and return list of string tokens. Function also should not return tokens containing digits and punctuation (you can do this with regular expressions). Function must have optional parameters:

to_lower - (shows that all tokens in returnable list must be in lower case)
min_token_size - (shows min length of returnable tokens)

AndreyKondakovGW commented 3 years ago

Exaple of this function: (on their basis you can write test fo this function)

tokenize(text="Классификация текстов (документов) (англ. Document classification) — задача компьютерной лингвистики") => ["Классификация", "текстов", "документов" , "англ", "Document", classification", "задача", "компьютерной", "лингвистики"]
tokenize(text="Классификация настроения текста из базы ANEW[3], : счастливый - 8.21; хороший - 7.47; скучный - 2.95;", to_lower = true, min_token_size=4) => ["классификация", "настроения", "текста", "базы", "anew", "счастливый", "хороший", "скучный"]

CyberSniff commented 3 years ago

I take it

AndreyKondakovGW commented 3 years ago

Also sugest add lemmatization option to lemmatize tokens, as independant function or as part of tokenize function tokenize(text="Классификация настроения текста из базы , to_lower = true, min_token_size=4, lemmatize = True) => ["классификация", "настроение", "текст", "база"]

Wolwer1nE commented 2 years ago

Looks like there is no ready to use library for lemmatization written in ruby, so we will focus on tokenzation in this issue and extract lemmatization as a separate issue.

mmcs-ruby / sentiment

Find/Create function for text tokenzation #5