rebeccajohnson88 / PPOL564_slides_activities

Repo for Georgetown McCourt's School of Public Policy's Data Science I (PPOL 564)
Creative Commons Zero v1.0 Universal
9 stars 13 forks source link

1.3 - A and B #53

Open sonali-sr opened 2 years ago

sonali-sr commented 2 years ago

1.3 sentiment analysis (10 points)

Sentiment analysis section of this script: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb

A. Subset the press releases to those labeled with one of three topics via topics_clean: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this doj_subset going forward and it should have 717 rows.

B. Write a function that takes one press release string as an input and:

Hints:

I used a function + list comprehension to execute and it takes about 30 seconds on my local machine; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717

jswsean commented 2 years ago

Hi Professor @rebeccajohnson88, so I ran the following code:

# storing the list of topics to be matched 
topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
pattern = "|".join(topics_match)

# filtering based on the pattern
doj[doj.topics_clean.str.contains(pattern, regex=True)].shape

and got (856, 6) as my output, instead of having 717 rows.

Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g. Hate Crimes; Project Safe Childhood) or only including rows with only one matched topic (e.g. either one of the three topics)?

rebeccajohnson88 commented 2 years ago

Hi Professor @rebeccajohnson88, so I ran the following code:

# storing the list of topics to be matched 
topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
pattern = "|".join(topics_match)

# filtering based on the pattern
doj[doj.topics_clean.str.contains(pattern, regex=True)].shape

and got (856, 6) as my output, instead of having 717 rows.

Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g. Hate Crimes; Project Safe Childhood) or only including rows with only one matched topic (e.g. either one of the three topics)?

either is fine but the 717 rows corresponds to this approach: only including rows with only one matched topic (e.g. either one of the three topics)? so the isin rather than str.contains or method

we filter to press releases with a single topical category for easier interpretation but are fine w/ that or you keeping the multiple topic ones in

sanhatahir commented 2 years ago

Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."

What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?

rebeccajohnson88 commented 2 years ago

Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."

What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?

image
rebeccajohnson88 commented 2 years ago

From student:

I would like to know how to apply the function to every content in doj_subset. The idea of my function is to conduct sentiment analysis for every row/content by using the (index? I don't know if I call this correctly). But I don't know how to calculate the scores for every row efficiently. I am thinking about something like for...in...if.., however, I am not sure how exactly to use it for this question.

image

Response:

I would check out the code in this activity--- https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb

it's doing something other than sentiment scoring but the general setup should work

## define function that takes in one airbnb string
def get_gpe(one_string):

    tagged_str = nlp(one_string) # tag

    ## this line (1) iterates over entities (for one_tok in tagged_str.ents),
    ## (2) checks if the label is gpe (place), and (3) if so, returns the 
    ## original entity
    all_gpe = [one_tok.text for one_tok in tagged_str.ents 
                  if one_tok.label_ == "GPE"]
    return(all_gpe)

## executing it over out example --- see that
## upper west side extracted from first; two blank (missed the central manhattan, possibly
## due the the /), then gets manhattan and west village
all_gpe = [get_gpe(one_string) for one_string in ab_name_examples]
all_gpe

i think it's best for (1) the function to take in a single press release rather than the entire dataframe and (2) you can then execute the function by iterating using list comprehension over doj_subset.contents - feel free to follow up w/ other questions!

minahkim-dspp commented 2 years ago

Hello! To run the re.sub function, I merged all the entities into a string with the or operator. spacy_press= nlp(press) spacy_list = [str(entity) for entity in spacy_press.ents] pattern = "r"+'"'+ "|".join(spacy_list)+'"'

Then, I realized that the re.sub remvoves only "Lousiana", not other long proper nouns, such as Louisiana State Penitentiary or "Louisiana Corey Amundson".

It also occurs errors when the named entities have ( or other characters that will have meaning in regex. Currently, I am just making my function skip the dropping section for these press release... which is sth I'm not sure if I'm supposed to do

How did you solve this issue?

Mag-Sul commented 2 years ago

I'm also stuck on the re.sub part of the function. I'm testing out the process for one test string before I create the function itself. This is what I tried, but it returns an empty string. Any tips?

Screen Shot 2022-11-02 at 10 11 07 PM
rebeccajohnson88 commented 2 years ago

tagging for @sonali-sr and @YfLiu-61F - the quick advice is:

Mag-Sul commented 2 years ago

tagging for @sonali-sr and @YfLiu-61F - the quick advice is:

  • spacy_test.ents is a special type of tagged object; calling str(spacy_test.ents) does not produce what you want it to-- it does produce a string but it's a weird string that puts together the entire list of entities
  • instead, and relating to slack discussion b/t ethan and keya, for each entity in spacy_test.ents, you'll likely want to use list comprehension or something like that to iterate over the distinct entities and then, for each entity, we want to use the ent.text attribute to extract the string form of that entity

UPDATE - RESOLVED OVER SLACK This is helpful thank you! I was able to get my function to run but it's not returning what I'd hoped and results in an error. Any idea where I might be going wrong?

Screen Shot 2022-11-03 at 11 29 49 AM
xinruizhao commented 2 years ago

image Hi, my code for defining and executing the function looks like this. The results are very unexpected. I was wondering if there is anything wrong with my code. Thank you in advance!

rebeccajohnson88 commented 2 years ago

@xinruizhao i think the issue is that you're feeding the function the entire doj_subset data frame rather than doj_subset.contents in your list comprehension- so you probably want to edit it so that you're iterating over doj_subset.contents rather than the whole df (and maybe add print statement to double check)

xinruizhao commented 2 years ago

@xinruizhao i think the issue is that you're feeding the function the entire doj_subset data frame rather than doj_subset.contents in your list comprehension- so you probably want to edit it so that you're iterating over doj_subset.contents rather than the whole df (and maybe add print statement to double check)

image Hi Professor, when I changed from doj_subset to doj_subset.contents, the error becomes this. I was wondering if there is anything wrong with the definition of the function.

sanhatahir commented 2 years ago

Hi @xinruizhao !

I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.

This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2

Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters

(Though again, my understanding might be extremely extremely wrong)

rebeccajohnson88 commented 2 years ago

Hi @xinruizhao !

I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.

This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2

Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters

(Though again, my understanding might be extremely extremely wrong)

that's 100% correct @sanhatahir! thanks for helping out and hopefully that helps answer your q @xinruizhao