Python module to do simple collocation analysis of a corpus.
Since we have implemented some tokenizer from nltk.tokenize
, one needs to install the corresponding packages first.
import nltk
nltk.download('punkt')
You can use PyCollocation
as via CLI. Example:
python3 analysis.py FOLDER SEARCH_TERM L_WINDOW R_WINDOW STATISTIC DOC_TYPE OUTPUT_FORMAT
python3 analysis.py /corpora "test" 3 3 freq folder print
The relative path to the folder including the files to be analyzed.
The search term. Can be a string or a regular expression.
Window size left of the search term.
Window size right of the search term.
The statistical measure for the analysis. Currently implemented:
The document type. Can be:
folder
)single
)iterable
)Can be:
print
)results.csv
(csv
)