inpho / topic-explorer

System for building, visualizing, and working with LDA topic models
https://www.hypershelf.org/
Other
92 stars 22 forks source link

High freq threshold not working exactly? #112

Closed colinallen closed 8 years ago

colinallen commented 8 years ago

Attachment contains models & corpus for "Jason" (some experimental data for a project of Fritz Breithaupt).

Prepped and trained model as show below, with high threshold of '50' reporting that "n't" will be excluded, but it is showing up in top ten of several of the topics -- e.g. k=20.3, k=20.6

[140-182-72-59:~/corpora/Fritz] colin% vsm init Jason

Corpus Name [Default: Jason] 
Building corpus from Jason
Detected 0 folders and 102 files in Jason
Constructing directory corpus, each file is a document
100%|##########################################################################|
Saving corpus as Jason/../models/Jason-freq5.npz

Config file Jason.ini exists. Overwrite? [Y/n] Y
Writing configuration file Jason.ini

TIP: Only initalizing corpus object and config file.
     Next prepare the corpus using:
         vsm prep Jason.ini
     Or skip directly to training LDA models using:
         vsm train Jason.ini
[140-182-72-59:~/corpora/Fritz] colin% vsp prep Jason --stopword-file Jason_stop 
vsp: Command not found.
[140-182-72-59:~/corpora/Fritz] colin% vsm prep Jason --stopword-file Jason_stop 
Jason is a directory, using the config file Jason.ini
Stoplist the following languages? 
English? [Y/n] 

Applying english stopwords
Applying custom stopword file to remove 117 words.
Filtering 20 small words with less than 3 characters.

************************* FILTER HIGH FREQUENCY WORDS *************************
    This will remove all words occurring more than N times.
    The histogram below shows how many words will be removed
    by selecting each maximum frequency threshold.

    Rate      Top % of corpus                                 # words     Rate
     53x    10.2% ███                                         1 words >    53x
     37x    17.4% ██████                                      2 words >    37x
     31x    23.4% ████████                                    3 words >    31x
     29x    29.1% ██████████                                  4 words >    29x
     28x    34.6% ████████████                                5 words >    28x
     17x    41.4% ██████████████                              7 words >    17x
     12x    51.8% ██████████████████                         11 words >    12x
      5x   100.0% ████████████████████████████████████       45 words >     5x
                  529 total occurrences                      45 words total

Enter the maximum rate: 50
Filter will remove 54 occurrences of these 1 words:
n't

Filter will remove 54 occurrences of these 1 words. 
Accept filter? [y/n/[different max number]] y
Filtering 1 high frequency word.

'''
[Archive.zip](https://github.com/inpho/topic-explorer/files/179990/Archive.zip)
JaimieMurdock commented 8 years ago

No longer replicable since we moved away from the Penn TreeBank tokenizer.