2.2 - Githubissues

sonali-sr commented 2 years ago

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the create_dtm function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns:id, compound sentiment column you added, and the topics_clean column.

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment). Hint: for these, remember the pandas quantile function from pset two.

D. Print the top 10 words for press releases in each of the three topics_clean

For steps B - D, to receive full credit, write a function get_topwords that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

Resources:

Here contains an example of applying the create_dtm function: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/09_textasdata_partII_topicmodeling_solution.ipynb

jswsean commented 2 years ago

Are different count results acceptable here, Professor?

rebeccajohnson88 commented 2 years ago

Are different count results acceptable here, Professor?

yep differences are fine since depend on preprocessing etc - thanks!

Mag-Sul commented 2 years ago

UPDATE: Resolved in office hours.

I initially got this to work but in double-checking my code, I realized I had accidentally had the same string in every row for my processed_text. Once I fixed that error (and got an accurate dictionary matrix), this no longer works. Any tips?

rebeccajohnson88 / PPOL564_slides_activities

2.2 #57

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)