Open sonali-sr opened 2 years ago
Just confirming -- the adjective tags include JJ
, JJR
and JJS
, right?
The following code gets me the same list of adjectives, but with different opioid
count result:
# restricting to isalpha(), storing in pharma_word
pharma_word = [word for word in pharma.split(" ") if word.isalpha() == True]
# part of speech tagging
pharma_word_tokens = pos_tag(pharma_word)
# extracting only the adjectives (JJ, JJR, JJS)
pharma_word_adj = [tup[0] for tup in pharma_word_tokens if tup[1] in ["JJ", "JJR", "JJS"]]
# counting the occurrences of each adjectives, storing them in adj_count dataframe
adj_count = {adj:pharma_word_adj.count(adj) for adj in pharma_word_adj}
adj_count = pd.DataFrame(list(adj_count.items()), columns = ['adj', 'count'])
adj_count.sort_values("count", ascending = False).head(n=5)
Sorry, just an update--I think I should've used word_tokenize()
instead of manually splitting the string as in .split(" ")
.
Not sure why, but there's a minor output difference with the split command.
Sorry, just an update--I think I should've used
word_tokenize()
instead of manually splitting the string as in.split(" ")
.Not sure why, but there's a minor output difference with the split command.
yep to respond!
split
or word_tokenize
is fine --- the preprocessing produces slightly different outputsHi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?
Hi, for clarification, we are only removing punctuation and digits in this step, but not removing stopwords, right?
yep correct since stopword removal can affect the tagging
1.1 part of speech tagging (3 points)
A. Preprocess the
pharma
press release to remove all punctuation / digits (so can use.isalpha()
to subset)B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech.
C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the
pharma
release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/Resources:
Documentation for .isalpha(): https://www.w3schools.com/python/ref_string_isalpha.asp
process_step1
function here has an example of tokenizing and filtering to words where .isalpha() is true: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/09_textasdata_partII_topicmodeling_solution.ipynbPart of speech tagging section of this code: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb