uqqu / frequency_analysis

package for symbol/word and their bigrams frequency analysis
1 stars 0 forks source link
frequency frequency-analysis nlp python

# Frequency analysis

Python package for symbol/word and their bigrams frequency analysis with excel output.
What values can be counted: quantity, quantity in the first position, quantity in the last position, average position.
For which data values can be counted: symbols, symbol bigrams, words, word bigrams.
Additional possible data: ye-yo words table for Russian language (in excel output it can be cross-referenced with words quantity).

Usage

  1. pip install frequency-analysis;
  2. Download data set of your choice;
  3. Call Analysis class with context manager (take a look at the optional arguments);
  4. Parse your data set to word list (one sentence in list for properly word position counting) and send it to one of three methods of Analysis;
  5. Call Result class with context manager (with optional name argument);
  6. Call one or several of Result methods to create excel sheet(s) with appropriate data.

Methods and arguments

Analysis class arguments

All arguments are optional

Analysis class methods

count_symbols(word_list: list, [pos: bool, bigrams: bool])

Method for counting symbol and symbol_bigram frequency.
Counted values: quantity, quantity in the first position, quantity in the last position, average position in word.
Average position counted only with argument pos as True (default False).
Position for symbols, which matched with word_pattern counted as for "clear" word, for other – as for "raw".
Example: in single word "–Yes!" with default word_pattern positions will be counted as (– 1), (Y 1), (e 2), (s 3), (! 5).
Bigrams counting can be disabled with argument bigram as False (default True).

count_words(word_list: list, [pos: bool, bigrams: bool])

Method for counting word and word_bigrams frequency.
Counted values: quantity, quantity in the first position, quantity in the last position, average position in sentence.
Average position counted only with argument pos as True (default False).
Bigrams counting can be disabled with argument bigram as False (default True).

count_all(word_list: list, [pos: bool, symbol_bigrams: bool, word_bigrams: bool])

Combined call of previous two methods.

Result class arguments

The only argument is optional

Result class methods

First 6 methods can be called all it once with treat() method

Many methods accept arguments limit, chart_limit, min_quantity and ignore_case.

sheet_stats()

Main result info – number of unique entries, total count and average position (if exists) for each data type.

sheet_top_symbols([limit, chart_limit, min_quantity])

Top list of all analyzed symbols sorted by quantity. The next to it is also located the same one list, but with ignore-case. There is no need to create separate sheet, just use column of your choice.

sheet_top_symbol_bigrams([limit, chart_limit, min_quantity])

Top list of symbol bigrams sorted by quantity with additional case insensitive top-list.

sheet_top_words([limit, chart_limit, min_quantity])

Top list of analyzed words sorted by quantity. Word counting is always case insensitive, on the Analyze stage.

sheet_top_word_bigrams([limit, chart_limit, min_quantity])

Top list of analyzed word bigrams sorted by quantity.

sheet_all_symbol_bigrams([min_quantity, ignore_case])

2D sheet with all bigrams quantity. min_quantity argument works here for sum of row/column values instead of each separated bigram.

treat([limits: tuple(four int), chart_limits: tuple(four int), min_quantities: tuple(five int)])

Single call of all Result methods above. Order of the tuple arguments are the same as order of description above.
Please note – the last one (value for sheet_all_symbol_bigrams()) there is only in the min_quantities argument.
Default values as elsewhere: limits – (0,)*4; chart_limits – (20,)*4; min_quantities – (1,)*5.

sheet_custom_top_symbols(symbols: str, [chart_limit, name='Custom top symbols'])

Create symbols top-list as sheet_top_symbols(), but only with symbols of your choice.
namekeyword-only

sheet_en_top_symbols([ignore_case])

Create symbols top-list as sheet_top_symbols(), but only with base Latin symbols.

sheet_ru_top_symbols([ignore_case])

Create symbols top-list as sheet_top_symbols(), but only with Russian Cyrillic symbols.

sheet_custom_symbol_bigrams(symbols: str, [ignore_case, name='Custom symbol bigrams'])

Create symbol bigrmas 2D sheet as sheet_all_symbol_bigrams(), but only with symbols of your choice.
Order of symbols on the sheet will be the same as in the input argument.
namekeyword-only

sheet_en_symbol_bigrams([ignore_case])

Create symbol bigrams 2D sheet as sheet_all_symbol_bigrams(), but only with base Latin symbols.

sheet_ru_symbol_bigrams([ignore_case])

Create symbol bigrams 2D sheet as sheet_all_symbol_bigrams(), but only with Russian Cyrillic symbols.

sheet_yo_words([limit, min_quantity])

Create cross-referenced sheet for all counted ye-yo words with their quantity and total misspells counter. Works only with analysis created with yo argument as 1 or 2.

Performed analyses