rebeccajohnson88 / PPOL564_slides_activities

Repo for Georgetown McCourt's School of Public Policy's Data Science I (PPOL 564)
Creative Commons Zero v1.0 Universal
9 stars 13 forks source link

1.1 #49

Open sonali-sr opened 2 years ago

sonali-sr commented 2 years ago

1.1 part of speech tagging (3 points)

A. Preprocess the pharma press release to remove all punctuation / digits (so can use .isalpha() to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the pharma release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

1 1 C

Resources:

jswsean commented 2 years ago

Just confirming -- the adjective tags include JJ, JJR and JJS, right?

The following code gets me the same list of adjectives, but with different opioid count result:

# restricting to isalpha(), storing in pharma_word
pharma_word = [word for word in pharma.split(" ") if word.isalpha() == True]

# part of speech tagging
pharma_word_tokens = pos_tag(pharma_word)

# extracting only the adjectives (JJ, JJR, JJS)
pharma_word_adj = [tup[0] for tup in pharma_word_tokens if tup[1] in ["JJ", "JJR", "JJS"]]

# counting the occurrences of each adjectives, storing them in adj_count dataframe
adj_count = {adj:pharma_word_adj.count(adj) for adj in pharma_word_adj}
adj_count = pd.DataFrame(list(adj_count.items()), columns = ['adj', 'count'])

adj_count.sort_values("count", ascending = False).head(n=5)

image

jswsean commented 2 years ago

Sorry, just an update--I think I should've used word_tokenize() instead of manually splitting the string as in .split(" ").

Not sure why, but there's a minor output difference with the split command.

rebeccajohnson88 commented 2 years ago

Sorry, just an update--I think I should've used word_tokenize() instead of manually splitting the string as in .split(" ").

Not sure why, but there's a minor output difference with the split command.

yep to respond!

xinruizhao commented 2 years ago

Hi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?

rebeccajohnson88 commented 2 years ago

Hi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?

yep correct since stopword removal can affect the tagging