Open sonali-sr opened 2 years ago
Hi Professor @rebeccajohnson88, so I ran the following code:
# storing the list of topics to be matched
topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
pattern = "|".join(topics_match)
# filtering based on the pattern
doj[doj.topics_clean.str.contains(pattern, regex=True)].shape
and got (856, 6)
as my output, instead of having 717 rows.
Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g. Hate Crimes; Project Safe Childhood
) or only including rows with only one matched topic (e.g. either one of the three topics)?
Hi Professor @rebeccajohnson88, so I ran the following code:
# storing the list of topics to be matched topics_match = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood'] pattern = "|".join(topics_match) # filtering based on the pattern doj[doj.topics_clean.str.contains(pattern, regex=True)].shape
and got
(856, 6)
as my output, instead of having 717 rows.Am I misunderstanding the instructions? Are we to include rows with more than one matched topics (e.g.
Hate Crimes; Project Safe Childhood
) or only including rows with only one matched topic (e.g. either one of the three topics)?
either is fine but the 717 rows corresponds to this approach: only including rows with only one matched topic (e.g. either one of the three topics)?
so the isin rather than str.contains or method
we filter to press releases with a single topical category for easier interpretation but are fine w/ that or you keeping the multiple topic ones in
Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."
What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?
Hi Professor! When you ask; "Apply that function to each of the press releases in doj_subset."
What kind of input are we expecting? Would it be 717 lines of content and their corresponding dictionaries?
contents
column and yep all 717 stringsFrom student:
I would like to know how to apply the function to every content in doj_subset. The idea of my function is to conduct sentiment analysis for every row/content by using the (index? I don't know if I call this correctly). But I don't know how to calculate the scores for every row efficiently. I am thinking about something like for...in...if.., however, I am not sure how exactly to use it for this question.
Response:
I would check out the code in this activity--- https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb
it's doing something other than sentiment scoring but the general setup should work
## define function that takes in one airbnb string
def get_gpe(one_string):
tagged_str = nlp(one_string) # tag
## this line (1) iterates over entities (for one_tok in tagged_str.ents),
## (2) checks if the label is gpe (place), and (3) if so, returns the
## original entity
all_gpe = [one_tok.text for one_tok in tagged_str.ents
if one_tok.label_ == "GPE"]
return(all_gpe)
## executing it over out example --- see that
## upper west side extracted from first; two blank (missed the central manhattan, possibly
## due the the /), then gets manhattan and west village
all_gpe = [get_gpe(one_string) for one_string in ab_name_examples]
all_gpe
i think it's best for (1) the function to take in a single press release rather than the entire dataframe and (2) you can then execute the function by iterating using list comprehension over doj_subset.contents
- feel free to follow up w/ other questions!
Hello! To run the re.sub function, I merged all the entities into a string with the or operator.
spacy_press= nlp(press) spacy_list = [str(entity) for entity in spacy_press.ents]
pattern = "r"+'"'+ "|".join(spacy_list)+'"'
Then, I realized that the re.sub remvoves only "Lousiana", not other long proper nouns, such as Louisiana State Penitentiary or "Louisiana Corey Amundson".
It also occurs errors when the named entities have ( or other characters that will have meaning in regex. Currently, I am just making my function skip the dropping section for these press release... which is sth I'm not sure if I'm supposed to do
How did you solve this issue?
I'm also stuck on the re.sub part of the function. I'm testing out the process for one test string before I create the function itself. This is what I tried, but it returns an empty string. Any tips?
tagging for @sonali-sr and @YfLiu-61F - the quick advice is:
str(spacy_test.ents)
does not produce what you want it to-- it does produce a string but it's a weird string that puts together the entire list of entitiesent.text
attribute to extract the string form of that entity tagging for @sonali-sr and @YfLiu-61F - the quick advice is:
- spacy_test.ents is a special type of tagged object; calling
str(spacy_test.ents)
does not produce what you want it to-- it does produce a string but it's a weird string that puts together the entire list of entities- instead, and relating to slack discussion b/t ethan and keya, for each entity in spacy_test.ents, you'll likely want to use list comprehension or something like that to iterate over the distinct entities and then, for each entity, we want to use the
ent.text
attribute to extract the string form of that entity
UPDATE - RESOLVED OVER SLACK This is helpful thank you! I was able to get my function to run but it's not returning what I'd hoped and results in an error. Any idea where I might be going wrong?
Hi, my code for defining and executing the function looks like this. The results are very unexpected. I was wondering if there is anything wrong with my code. Thank you in advance!
@xinruizhao i think the issue is that you're feeding the function the entire doj_subset
data frame rather than doj_subset.contents
in your list comprehension- so you probably want to edit it so that you're iterating over doj_subset.contents
rather than the whole df (and maybe add print statement to double check)
@xinruizhao i think the issue is that you're feeding the function the entire
doj_subset
data frame rather thandoj_subset.contents
in your list comprehension- so you probably want to edit it so that you're iterating overdoj_subset.contents
rather than the whole df (and maybe add print statement to double check)
Hi Professor, when I changed from doj_subset
to doj_subset.contents
, the error becomes this. I was wondering if there is anything wrong with the definition of the function.
Hi @xinruizhao !
I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.
This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters
(Though again, my understanding might be extremely extremely wrong)
Hi @xinruizhao !
I faced the same issue; as far as I could make out it's because there's an open bracket "(" in one of the named entities and Python thinks this is a special character instead of a regular character that you need to match exactly.
This page might be useful: https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
Essentially, you'll want to wrap up the named entities in the re.escape function before joining them - this would wrap special characters in escape characters so Python recognizes them as regular characters
(Though again, my understanding might be extremely extremely wrong)
that's 100% correct @sanhatahir! thanks for helping out and hopefully that helps answer your q @xinruizhao
1.3 sentiment analysis (10 points)
Sentiment analysis section of this script: https://github.com/rebeccajohnson88/PPOL564_slides_activities/blob/main/activities/fall_22/solutions/08_textasdata_partI_textmining_solutions.ipynb
A. Subset the press releases to those labeled with one of three topics via
topics_clean
: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call thisdoj_subset
going forward and it should have 717 rows.B. Write a function that takes one press release string as an input and:
re.sub
with an or condition)SentimentIntensityAnalyzer
andpolarity_scores
doj_subset.
Hints:
I used a function + list comprehension to execute and it takes about 30 seconds on my local machine; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717