mmcs-ruby / sentiment

MIT License
0 stars 8 forks source link

Determinate of words frequence #4

Open denis46-g opened 3 years ago

denis46-g commented 3 years ago

1)Create list of words frequences for detecting rare words in text 2)Create function which deletes rare and short words which are not need for finding emotions

AndreyKondakovGW commented 3 years ago

for text preprocessing this function must:

  1. Cretae function for stemming the words (reduse it to root form). function must get tokeized text(list of words) and return list of root forms
  2. Create function for counting word frequency this function must get corpus: list of tokenized texts (list of list of words) and return dictionary where each word will matched with it frequency in corpus
  3. Create function for delete words with very hight and very low frequency. Function should gets dictionary from (2), corpus of texts, and two parametrs:
  4. min_freq: (determine bottom border of word frequency in corpus, (words with lower freq must be deleted))
  5. max_freq: (border of most frequency word in percent (for exaple if max_freq = 0.9 we should delete 10% of most frequent word in out corpus))
AndreyKondakovGW commented 3 years ago

I also suggest to not implimetnt this functions until we decide what vectorization function we will use. Because in process we can find ready-made implemitation of this functions in another gems

denis46-g commented 3 years ago

The issues 2-5 in Andrey's comment will be my issues. (Goskov Denis)

Wolwer1nE commented 2 years ago

https://codeclimate.com/github/mmcs-ruby/sentiment/issues?category=complexity&engine_name%5B%5D=structure&engine_name%5B%5D=duplication I merged the PR, but there are some code quality issues.