1.1 - Githubissues

sonali-sr commented 2 years ago

1.1 part of speech tagging (3 points)

A. Preprocess the pharma press release to remove all punctuation / digits (so can use .isalpha() to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the pharma release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

Resources:

Documentation for .isalpha(): https://www.w3schools.com/python/ref_string_isalpha.asp
process_step1 function here has an example of tokenizing and filtering to words where .isalpha() is true: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/09_textasdata_partII_topicmodeling_solution.ipynb
Part of speech tagging section of this code: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb

jswsean commented 2 years ago

Just confirming -- the adjective tags include JJ, JJR and JJS, right?

The following code gets me the same list of adjectives, but with different opioid count result:

# restricting to isalpha(), storing in pharma_word
pharma_word = [word for word in pharma.split(" ") if word.isalpha() == True]

# part of speech tagging
pharma_word_tokens = pos_tag(pharma_word)

# extracting only the adjectives (JJ, JJR, JJS)
pharma_word_adj = [tup[0] for tup in pharma_word_tokens if tup[1] in ["JJ", "JJR", "JJS"]]

# counting the occurrences of each adjectives, storing them in adj_count dataframe
adj_count = {adj:pharma_word_adj.count(adj) for adj in pharma_word_adj}
adj_count = pd.DataFrame(list(adj_count.items()), columns = ['adj', 'count'])

adj_count.sort_values("count", ascending = False).head(n=5)

jswsean commented 2 years ago

Sorry, just an update--I think I should've used word_tokenize() instead of manually splitting the string as in .split(" ").

Not sure why, but there's a minor output difference with the split command.

rebeccajohnson88 commented 2 years ago

Sorry, just an update--I think I should've used word_tokenize() instead of manually splitting the string as in .split(" ").

Not sure why, but there's a minor output difference with the split command.

yep to respond!

we're checking for correct code and not exact match to counts so either split or word_tokenize is fine --- the preprocessing produces slightly different outputs
i can share a code snippet tomorrow that illustrates the differences; the main thing is that split only splits on whatever delimiter you specify; word_tokenize does a better job at splitting on delimiters beyond spaces w/out needing to specify additional parts of a regex pattern --- more discussion here - https://stackoverflow.com/questions/35345761/python-re-split-vs-nltk-word-tokenize-and-sent-tokenize

xinruizhao commented 2 years ago

Hi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?

rebeccajohnson88 commented 2 years ago

Hi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?

yep correct since stopword removal can affect the tagging

rebeccajohnson88 / PPOL564_slides_activities

1.1 #49

1.1 part of speech tagging (3 points)