Closed rebeccajohnson88 closed 2 years ago
Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur
Any idea why mine might be getting rid of 'nine' in the first one? I assume it is fine but was curious as to why the difference might occur
hm it depends on your preprocessing function; maybe you're requiring at least 5 characters in a word?
Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?
Should all of the words in the preprocessed text be at least 4 characters before stemming (and may possibly end up less than 4 characters after stemming) or should they be at least 4 characters after stemming is completed?
at least 4 characters before stemming
2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)
A. Write a function that:
Takes in a single raw string in the
contents
column from that dataframeDoes the following preprocessing steps:
Returns a joined preprocessed string
B. Use
apply
or list comprehension to execute that function and create a new column in the data calledprocessed_text
C. Print the
id
,contents
, andprocessed_text
columns for the following press releases:id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)
id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
Resources:
Output for those two press releases (cutoff at end but you can get gist):