Open sonali-sr opened 2 years ago
Are different count results acceptable here, Professor?
Are different count results acceptable here, Professor?
yep differences are fine since depend on preprocessing etc - thanks!
UPDATE: Resolved in office hours.
I initially got this to work but in double-checking my code, I realized I had accidentally had the same string in every row for my processed_text. Once I fixed that error (and got an accurate dictionary matrix), this no longer works. Any tips?
2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)
A. Use the
create_dtm
function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns:id
,compound sentiment
column you added, and thetopics_clean
column.B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)
C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment). Hint: for these, remember the pandas quantile function from pset two.
D. Print the top 10 words for press releases in each of the three topics_clean
For steps B - D, to receive full credit, write a function
get_topwords
that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.Resources: