1.3 - A and B - Githubissues

sonali-sr commented 2 years ago

1.3 sentiment analysis (10 points)

Sentiment analysis section of this script: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb

A. Subset the press releases to those labeled with one of three topics via topics_clean: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this doj_subset going forward and it should have 717 rows.

B. Write a function that takes one press release string as an input and:

Removes named entities from each press release string (Hint: you may want to use re.sub with an or condition)
Scores the sentiment of the entire press release using the SentimentIntensityAnalyzer and polarity_scores
Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)
Apply that function to each of the press releases in doj_subset.

Hints:

I used a function + list comprehension to execute and it takes about 30 seconds on my local machine; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717

jswsean commented 2 years ago

Hi Professor @rebeccajohnson88, so I ran the following code:

# storing the list of topics to be matched 
topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
pattern = "|".join(topics_match)

# filtering based on the pattern
doj[doj.topics_clean.str.contains(pattern, regex=True)].shape

and got (856, 6) as my output, instead of having 717 rows.

Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g. Hate Crimes; Project Safe Childhood) or only including rows with only one matched topic (e.g. either one of the three topics)?

rebeccajohnson88 commented 2 years ago

Hi Professor @rebeccajohnson88, so I ran the following code:
# storing the list of topics to be matched 
topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
pattern = "|".join(topics_match)

# filtering based on the pattern
doj[doj.topics_clean.str.contains(pattern, regex=True)].shape
and got (856, 6) as my output, instead of having 717 rows.

Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g. Hate Crimes; Project Safe Childhood) or only including rows with only one matched topic (e.g. either one of the three topics)?

either is fine but the 717 rows corresponds to this approach: only including rows with only one matched topic (e.g. either one of the three topics)? so the isin rather than str.contains or method

we filter to press releases with a single topical category for easier interpretation but are fine w/ that or you keeping the multiple topic ones in

sanhatahir commented 2 years ago

Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."

What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?

rebeccajohnson88 commented 2 years ago

Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."

What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?

Input should be the contents column and yep all 717 strings
this step just asks you to execute the function --- next question discusses format of output (add the 4 scores as new cols in the dataframe)

rebeccajohnson88 commented 2 years ago

From student:

I would like to know how to apply the function to every content in doj_subset. The idea of my function is to conduct sentiment analysis for every row/content by using the (index? I don't know if I call this correctly). But I don't know how to calculate the scores for every row efficiently. I am thinking about something like for...in...if.., however, I am not sure how exactly to use it for this question.

Response:

I would check out the code in this activity--- https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb

it's doing something other than sentiment scoring but the general setup should work

## define function that takes in one airbnb string
def get_gpe(one_string):

    tagged_str = nlp(one_string) # tag

    ## this line (1) iterates over entities (for one_tok in tagged_str.ents),
    ## (2) checks if the label is gpe (place), and (3) if so, returns the 
    ## original entity
    all_gpe = [one_tok.text for one_tok in tagged_str.ents 
                  if one_tok.label_ == "GPE"]
    return(all_gpe)

## executing it over out example --- see that
## upper west side extracted from first; two blank (missed the central manhattan, possibly
## due the the /), then gets manhattan and west village
all_gpe = [get_gpe(one_string) for one_string in ab_name_examples]
all_gpe

i think it's best for (1) the function to take in a single press release rather than the entire dataframe and (2) you can then execute the function by iterating using list comprehension over doj_subset.contents - feel free to follow up w/ other questions!

minahkim-dspp commented 2 years ago

Hello! To run the re.sub function, I merged all the entities into a string with the or operator. spacy_press= nlp(press) spacy_list = [str(entity) for entity in spacy_press.ents] pattern = "r"+'"'+ "|".join(spacy_list)+'"'

Then, I realized that the re.sub remvoves only "Lousiana", not other long proper nouns, such as Louisiana State Penitentiary or "Louisiana Corey Amundson".

It also occurs errors when the named entities have ( or other characters that will have meaning in regex. Currently, I am just making my function skip the dropping section for these press release... which is sth I'm not sure if I'm supposed to do

How did you solve this issue?

Mag-Sul commented 2 years ago

I'm also stuck on the re.sub part of the function. I'm testing out the process for one test string before I create the function itself. This is what I tried, but it returns an empty string. Any tips?

rebeccajohnson88 commented 2 years ago

tagging for @sonali-sr and @YfLiu-61F - the quick advice is:

spacy_test.ents is a special type of tagged object; calling str(spacy_test.ents) does not produce what you want it to-- it does produce a string but it's a weird string that puts together the entire list of entities
instead, and relating to slack discussion b/t ethan and keya, for each entity in spacy_test.ents, you'll likely want to use list comprehension or something like that to iterate over the distinct entities and then, for each entity, we want to use the ent.text attribute to extract the string form of that entity

Mag-Sul commented 2 years ago

tagging for @sonali-sr and @YfLiu-61F - the quick advice is:

spacy_test.ents is a special type of tagged object; calling str(spacy_test.ents) does not produce what you want it to-- it does produce a string but it's a weird string that puts together the entire list of entities

instead, and relating to slack discussion b/t ethan and keya, for each entity in spacy_test.ents, you'll likely want to use list comprehension or something like that to iterate over the distinct entities and then, for each entity, we want to use the ent.text attribute to extract the string form of that entity

UPDATE - RESOLVED OVER SLACK This is helpful thank you! I was able to get my function to run but it's not returning what I'd hoped and results in an error. Any idea where I might be going wrong?

xinruizhao commented 2 years ago

Hi, my code for defining and executing the function looks like this. The results are very unexpected. I was wondering if there is anything wrong with my code. Thank you in advance!

rebeccajohnson88 commented 2 years ago

@xinruizhao i think the issue is that you're feeding the function the entire doj_subset data frame rather than doj_subset.contents in your list comprehension- so you probably want to edit it so that you're iterating over doj_subset.contents rather than the whole df (and maybe add print statement to double check)

xinruizhao commented 2 years ago

@xinruizhao i think the issue is that you're feeding the function the entire doj_subset data frame rather than doj_subset.contents in your list comprehension- so you probably want to edit it so that you're iterating over doj_subset.contents rather than the whole df (and maybe add print statement to double check)

Hi Professor, when I changed from doj_subset to doj_subset.contents, the error becomes this. I was wondering if there is anything wrong with the definition of the function.

sanhatahir commented 2 years ago

Hi @xinruizhao !

I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.

This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2

Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters

(Though again, my understanding might be extremely extremely wrong)

rebeccajohnson88 commented 2 years ago

Hi @xinruizhao !

I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.

This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2

Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters

(Though again, my understanding might be extremely extremely wrong)

that's 100% correct @sanhatahir! thanks for helping out and hopefully that helps answer your q @xinruizhao

rebeccajohnson88 / PPOL564_slides_activities

1.3 - A and B #53

1.3 sentiment analysis (10 points)