rebeccajohnson88 / PPOL564_slides_activities

Repo for Georgetown McCourt's School of Public Policy's Data Science I (PPOL 564)
Creative Commons Zero v1.0 Universal
9 stars 13 forks source link

1.2 D #52

Open sonali-sr opened 1 year ago

sonali-sr commented 1 year ago

1.2 named entity recognition (3 points)

D. Parse the pharma string at the sentence level. Note that this involves more than just splitting on each.; for full credit, add at least one additional delimiter that marks the end of the sentence.

Then, using those sentences, pull and print the original sentences from the press releases where those year lengths are mentioned. Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum).

1 2D

Hint: You may want to use re.search or re.findall

Resources:

jswsean commented 1 year ago

Would it be fine to use libraries instead of using regex?

rebeccajohnson88 commented 1 year ago

Would it be fine to use libraries instead of using regex?

we'd prefer you use regex for this one for more practice with patterns - but it's 100% fine to not capture all delimiters; you can just add 1-2 more beyond . based on what you see in the string

rebeccajohnson88 commented 1 year ago

Question from student: I'm working on pset5 question 1.2D and I'm stuck on the part where we need to print the sentences with the year tokens. This is what I have and what I've tried.

https://files.slack.com/files-pri/T03TPUKC189-F048SS94QKX/screen_shot_2022-10-31_at_3.54.16_pm.png

Our response:

you can try an easier pattern --- for instance \.|\." would split on both . and ." --- you shouldn't necessarily need to use lookarounds (don't worry if some sentences are misparsed)

Mag-Sul commented 1 year ago

Question from student: I'm working on pset5 question 1.2D and I'm stuck on the part where we need to print the sentences with the year tokens. This is what I have and what I've tried.

https://files.slack.com/files-pri/T03TPUKC189-F048SS94QKX/screen_shot_2022-10-31_at_3.54.16_pm.png

Our response:

you can try an easier pattern --- for instance \.|\." would split on both . and ." --- you shouldn't necessarily need to use lookarounds (don't worry if some sentences are misparsed)

Thanks! Apologies - my issue isn't with the regex code itself (at least I don't think) since it seems to work (or at least good enough)

Screen Shot 2022-10-31 at 4 58 40 PM

But for the next part - printing just the sentences containing the year tokens. I'm not sure how to pull and print the original sentences from the press releases where those year lengths are mentioned. I commented out my attempts.

rebeccajohnson88 commented 1 year ago

Question from student: I'm working on pset5 question 1.2D and I'm stuck on the part where we need to print the sentences with the year tokens. This is what I have and what I've tried. https://files.slack.com/files-pri/T03TPUKC189-F048SS94QKX/screen_shot_2022-10-31_at_3.54.16_pm.png Our response: you can try an easier pattern --- for instance \.|\." would split on both . and ." --- you shouldn't necessarily need to use lookarounds (don't worry if some sentences are misparsed)

Thanks! Apologies - my issue isn't with the regex code itself (at least I don't think) since it seems to work (or at least good enough)

Screen Shot 2022-10-31 at 4 58 40 PM

But for the next part - printing just the sentences containing the year tokens. I'm not sure how to pull and print the original sentences from the press releases where those year lengths are mentioned. I commented out my attempts.

got it! i'd think about:

so the list comprehension where you iterate over sentences is correct general approach but rather than "=" or "in" you're checking for matches to words from the previous step

Mag-Sul commented 1 year ago

Question from student: I'm working on pset5 question 1.2D and I'm stuck on the part where we need to print the sentences with the year tokens. This is what I have and what I've tried. https://files.slack.com/files-pri/T03TPUKC189-F048SS94QKX/screen_shot_2022-10-31_at_3.54.16_pm.png Our response: you can try an easier pattern --- for instance \.|\." would split on both . and ." --- you shouldn't necessarily need to use lookarounds (don't worry if some sentences are misparsed)

Thanks! Apologies - my issue isn't with the regex code itself (at least I don't think) since it seems to work (or at least good enough)

Screen Shot 2022-10-31 at 4 58 40 PM

But for the next part - printing just the sentences containing the year tokens. I'm not sure how to pull and print the original sentences from the press releases where those year lengths are mentioned. I commented out my attempts.

got it! i'd think about:

  • taking the list you created in the previous step
  • pasting together using the | syntax
  • using re.search or another regex to find matches

so the list comprehension where you iterate over sentences is correct general approach but rather than "=" or "in" you're checking for matches to words from the previous step

Ahh this is helpful - thanks! I was thinking I only needed regex in the first part but I see I need it for the second part too.

Oliz888 commented 1 year ago

\.|\." wo

HIhi, i am wondering what it means by "this involves more than just splitting on each."

rebeccajohnson88 commented 1 year ago

\.|\." wo

HIhi, i am wondering what it means by "this involves more than just splitting on each."

  • Does this mean should also include " ." " ?

yep were flexible on what exactly you split on but the goal is to feed split a pattern that deals w/ the fact that sentences sometimes end in things like ." rather than just .

Oliz888 commented 1 year ago

\.|\." wo

HIhi, i am wondering what it means by "this involves more than just splitting on each."

  • Does this mean should also include " ." " ?

yep were flexible on what exactly you split on but the goal is to feed split a pattern that deals w/ the fact that sentences sometimes end in things like ." rather than just .

Okay, got it. Thank you professor!

bhollan commented 1 year ago

Can I ask a silly question? Why aren't we using the spacy-provided sentences? This just seems like reinventing the wheel.

spacy_doc = nlp(pharma)
spacy_doc.sents

I wrote a list comprehension within a list comprehension to use the extracted entities and then a set comprehension to make it unique.

LL = [[sent for sent in spacy_doc.sents if str(e) in str(sent)] for e in CEO]
{s for l in LL for s in l}

Would this get full credit?

rebeccajohnson88 commented 1 year ago

Can I ask a silly question? Why aren't we using the spacy-provided sentences? This just seems like reinventing the wheel.

spacy_doc = nlp(pharma)
spacy_doc.sents

I wrote a list comprehension within a list comprehension to use the extracted entities and then a set comprehension to make it unique.

LL = [[sent for sent in spacy_doc.sents if str(e) in str(sent)] for e in CEO]
{s for l in LL for s in l}

Would this get full credit?

not a silly question!

you're right that the sents attribute works or sent_tokenize() in nltk

the pedagogical goal with asking you guys to use .split() for this question was to (1) underscore that for functions like this, we want to read documentation/not just use the default values for parameters (which in this case splits on space) and (2) to reiterate the regex material

so even tho the more automatic ways work better, we do ask that you use split + some pattern for this for those pedagogical reasons