Attachment contains models & corpus for "Jason" (some experimental data for a project of Fritz Breithaupt).
Prepped and trained model as show below, with high threshold of '50' reporting that "n't" will be excluded, but it is showing up in top ten of several of the topics -- e.g. k=20.3, k=20.6
[140-182-72-59:~/corpora/Fritz] colin% vsm init Jason
Corpus Name [Default: Jason]
Building corpus from Jason
Detected 0 folders and 102 files in Jason
Constructing directory corpus, each file is a document
100%|##########################################################################|
Saving corpus as Jason/../models/Jason-freq5.npz
Config file Jason.ini exists. Overwrite? [Y/n] Y
Writing configuration file Jason.ini
TIP: Only initalizing corpus object and config file.
Next prepare the corpus using:
vsm prep Jason.ini
Or skip directly to training LDA models using:
vsm train Jason.ini
[140-182-72-59:~/corpora/Fritz] colin% vsp prep Jason --stopword-file Jason_stop
vsp: Command not found.
[140-182-72-59:~/corpora/Fritz] colin% vsm prep Jason --stopword-file Jason_stop
Jason is a directory, using the config file Jason.ini
Stoplist the following languages?
English? [Y/n]
Applying english stopwords
Applying custom stopword file to remove 117 words.
Filtering 20 small words with less than 3 characters.
************************* FILTER HIGH FREQUENCY WORDS *************************
This will remove all words occurring more than N times.
The histogram below shows how many words will be removed
by selecting each maximum frequency threshold.
Rate Top % of corpus # words Rate
53x 10.2% ███ 1 words > 53x
37x 17.4% ██████ 2 words > 37x
31x 23.4% ████████ 3 words > 31x
29x 29.1% ██████████ 4 words > 29x
28x 34.6% ████████████ 5 words > 28x
17x 41.4% ██████████████ 7 words > 17x
12x 51.8% ██████████████████ 11 words > 12x
5x 100.0% ████████████████████████████████████ 45 words > 5x
529 total occurrences 45 words total
Enter the maximum rate: 50
Filter will remove 54 occurrences of these 1 words:
n't
Filter will remove 54 occurrences of these 1 words.
Accept filter? [y/n/[different max number]] y
Filtering 1 high frequency word.
'''
[Archive.zip](https://github.com/inpho/topic-explorer/files/179990/Archive.zip)
Attachment contains models & corpus for "Jason" (some experimental data for a project of Fritz Breithaupt).
Prepped and trained model as show below, with high threshold of '50' reporting that "n't" will be excluded, but it is showing up in top ten of several of the topics -- e.g. k=20.3, k=20.6