rebeccajohnson88 / PPOL564_slides_activities

Repo for Georgetown McCourt's School of Public Policy's Data Science I (PPOL 564)
Creative Commons Zero v1.0 Universal
9 stars 13 forks source link

2.2 #57

Open sonali-sr opened 1 year ago

sonali-sr commented 1 year ago

2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the create_dtm function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns:id, compound sentiment column you added, and the topics_clean column.

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

2,2B

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment). Hint: for these, remember the pandas quantile function from pset two.

2,2C

D. Print the top 10 words for press releases in each of the three topics_clean

2 2D

For steps B - D, to receive full credit, write a function get_topwords that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

Resources:

jswsean commented 1 year ago

Are different count results acceptable here, Professor?

rebeccajohnson88 commented 1 year ago

Are different count results acceptable here, Professor?

yep differences are fine since depend on preprocessing etc - thanks!

Mag-Sul commented 1 year ago

UPDATE: Resolved in office hours.

I initially got this to work but in double-checking my code, I realized I had accidentally had the same string in every row for my processed_text. Once I fixed that error (and got an accurate dictionary matrix), this no longer works. Any tips?

Screen Shot 2022-11-06 at 10 49 52 PM