PS4 2.1 - Githubissues

rebeccajohnson88 commented 2 years ago

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

Takes in a single raw string in the contents column from that dataframe
Does the following preprocessing steps:
- Converts the words to lowercase
- Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
- Only retains alpha words (so removes digits and punctuation)
- Only retains words 4 characters or longer
- Uses the snowball stemmer from nltk to stem
Returns a joined preprocessed string

B. Use apply or list comprehension to execute that function and create a new column in the data called processed_text

C. Print the id, contents, and processed_text columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)

Resources:

Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
Here's more condensed code with topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/solutions/06_textasdata_partII_topicmodeling_solution.ipynb
Here's longer code with more broken-out topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/w22_activities/06_textasdata_partII_topicmodeling.ipynb

Output for those two press releases (cutoff at end but you can get gist):

MarcoAllen1 commented 2 years ago

Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur

rebeccajohnson88 commented 2 years ago

Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur

hm it depends on your preprocessing function; maybe you're requiring at least 5 characters in a word?

emmaw219 commented 2 years ago

Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?

rebeccajohnson88 commented 2 years ago

Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?

at least 4 characters before stemming

rebeccajohnson88 / qss20_slides_activities

PS4 2.1 #32

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)