obdurodon / dh_course

Digital Humanities course site
GNU General Public License v3.0
20 stars 6 forks source link

Adding Additional Stopwords: MALLET #474

Closed nmcdowell00 closed 4 years ago

nmcdowell00 commented 4 years ago

I am currently using Mallet to analyze a collection of letters Vincent van Gogh sent to his brother. I have successfully generated results but I find that my topics keep including words such as "i'm, it's, he's, i've, there's, i'd, that's, etc. I do not find these results helpful for analyzing the letters so I am trying to exclude them by creating a list of additional stopwords. To do this I created a txt file in oxygen and entered all of my desired words(each word on a new line)then saved it as addConj.txt to the stop-lists sub-directory in the MALLET directory on my desktop. I then entered the following into the command line(terminal):

"./bin/mallet import-dir --input vgCorp/ --output t13.mallet --keep-sequence --remove-stopwords --extra-stopwords stoplists/addConj.txt" followed by: "./bin/mallet train-topics --input t13.mallet --num-topics 5 --num-iterations 100 --output-state t13_5i100.gz --output-topic-keys t13_5ai100.txt --output-doc-topics t13_5bi100.txt" I am a mac user

At first this seemed to work but when I went over the generated topics I realized all of the words I wished to exclude still appeared. To try to fix this I moved my file containing the stopwords to the main mallet directory and ran the command: "./bin/mallet import-dir --input vgCorp/ --output t14.mallet --keep-sequence --remove-stopwords --extra-stopwords addConj.txt" followed by: "./bin/mallet train-topics --input t14.mallet --num-topics 5 --num-iterations 100 --output-state t14_5i100.gz --output-topic-keys t14_5ai100.txt --output-doc-topics t14_5bi100.txt"

The topics generated from this still did not exclude the desired words. The exclusion of these words is not essential to my analysis; I believe I am able to generate topics that provide a meaningful analysis of the corpus even with these words include. If I were able to exclude these words it would help make my topics more concise which I would believe would improve my ability to analyze the results. Any information on how to exclude certain words would be very much appreciated.

Additional Questions: Is there a number at which increasing iterations becomes futile? 100? 500? 10,000?

djbpitt commented 4 years ago

@nmcdowell00

Stopwords: Try specifying a full path to the extra stopword list, instead of a relative path. See https://github.com/mimno/Mallet/issues/48 for the suggestion.

Iterations: The number of iterations is hit or miss, sort of like the number of topics. The following is bit of a guess, but try 500, 600, 700, 800, 900, 1000. If the results get steadily better, keep going. If they don’t, pick the best one and stop there.

nmcdowell00 commented 4 years ago

I tried these variations and none of them managed to exclude the desired words. I had the text file in the proper place for each command. Is there something I am still doing wrong?

./bin/mallet import-dir --input vgCorp/ --output t_17.mallet --keep-sequence --remove-stopwords --extra-stopwords /Users/natemcdowell/Desktop/mallet-2.0.8/stoplists/genConj.txt/

./bin/mallet import-dir --input vgCorp/ --output t_16.mallet --keep-sequence --remove-stopwords --extra-stopwords /Users/natemcdowell/Desktop/mallet-2.0.8/stoplists/genConj.txt

./bin/mallet import-dir --input vgCorp/ --output t_15.mallet --keep-sequence --remove-stopwords --extra-stopwords /Users/natemcdowell/Desktop/mallet-2.0.8/genConj.txt

djbpitt commented 4 years ago

@nmcdowell00

Putting a slash after the filename (the first version) shouldn't work.

You write (first message) that "I created a txt file in oxygen and entered all of my desired words(each word on a new line)then saved it as addConj.txt to the stop-lists sub-directory in the MALLET directory on my desktop." You write "stop-lists sub-directory" there, but the examples you provide (second message) write "stoplists" (first two versions) or don't specify a subdirectory at all (third version). Check the full directory path carefully and be sure that you copy it exactly. You can check whether the path is correct by typing the following at the command prompt:

cat /Users/natemcdowell/Desktop/mallet-2.0.8/stop-lists/genConj.txt

(assuming the genConj.txt file is in a stop-lists subdirectory under the Mallet subdirectory on your Desktop). If the path is correct, this will scroll the file on your screen. If you get an error message or no result, that means that the path to the file is not correct.

djbpitt commented 4 years ago

@nmcdowell00 For the record: we identified the issue as the use of the curly apostrophe in the documents, while the stopword list had straight apostrophes. When we edited the stopword list to use curly apostrophes instead, it worked as expected. Closing this issue.