rebeccajohnson88 / qss20_slides_activities

Repo for slides and activities for qss 20
8 stars 29 forks source link

PS4 2.1 #32

Closed rebeccajohnson88 closed 2 years ago

rebeccajohnson88 commented 2 years ago

2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

B. Use apply or list comprehension to execute that function and create a new column in the data called processed_text

C. Print the id, contents, and processed_text columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)

Resources:

Output for those two press releases (cutoff at end but you can get gist):

image image
MarcoAllen1 commented 2 years ago

Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur

rebeccajohnson88 commented 2 years ago

Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur

hm it depends on your preprocessing function; maybe you're requiring at least 5 characters in a word?

emmaw219 commented 2 years ago

Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?

rebeccajohnson88 commented 2 years ago

Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?

at least 4 characters before stemming