rebeccajohnson88 / qss20_slides_activities

Repo for slides and activities for qss 20
8 stars 29 forks source link

PS4 1.1 A #26

Closed rebeccajohnson88 closed 2 years ago

rebeccajohnson88 commented 2 years ago

1. NLP on one press release (10 points)

Focus on the following press release: id == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c

The contents column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call this press release pharma

1.1 part of speech tagging (3 points)

A. Preprocess the pharma press release to remove all punctuation / digits (so can use .isalpha() to subset)

Example output from first five tokens:

image

rebeccajohnson88 commented 2 years ago

There was a typo in first version pushed to GitHub and the id read: 17-2014. The correct id is 17-1204 so i'd repull the ipynb (that was only correction so if already pulled feel free to just correct yourself)

rebeccajohnson88 commented 2 years ago

Here's start of string for that one (and code for how to subset to that)

image
MarcoAllen1 commented 2 years ago

Here's start of string for that one (and code for how to subset to that)

image

little confused by this because it seems to introduce the \xa0 into the string... any idea how to get rid of that?

rebeccajohnson88 commented 2 years ago

Here's start of string for that one (and code for how to subset to that)

image

little confused by this because it seems to introduce the \xa0 into the string... any idea how to get rid of that?

yep that's unicode stuff --- later when you splitting to sentences/do other cleaning, you can use str replace methods to clean: https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python

dont worry too much about it at this question

MarcoAllen1 commented 2 years ago

Is the example of the first five here before pre-processing? When I use the tokenize and isapha subsetting mine are no longer uppercase and the end of some words are cut off. Is that okay??

MarcoAllen1 commented 2 years ago

Is the example of the first five here before pre-processing? When I use the tokenize and isapha subsetting mine are no longer uppercase and the end of some words are cut off. Is that okay??

in a similiar note, when I tokenize and then use isalpha to subset, I lose all words that have a period following them such as inc. Is this the right step for something like that?

rebeccajohnson88 commented 2 years ago

Is the example of the first five here before pre-processing? When I use the tokenize and isapha subsetting mine are no longer uppercase and the end of some words are cut off. Is that okay??

in a similiar note, when I tokenize and then use isalpha to subset, I lose all words that have a period following them such as inc. Is this the right step for something like that?

example above is after preprocessing (it's a list of tokens so the commas are just separating tokens)- i wouldnt worry too much about matching exactly and fine either way

clairebetzer commented 2 years ago

Would you recommend performing .isalpha() on the list of tokens or on the original pharma string itself? When I use .isalpha() on the string, my output is an empty "", but when I try to use it on the tokens, I get an attribute error.

chris-picard commented 2 years ago

Would you recommend performing .isalpha() on the list of tokens or on the original pharma string itself? When I use .isalpha() on the string, my output is an empty "", but when I try to use it on the tokens, I get an attribute error.

I am pretty sure .isalpha requires a string, not a list. So if we are tokenizing and creating a list then I think we won't be able to use .isalpha, unless you join the list together to form a string.

rebeccajohnson88 commented 2 years ago

Would you recommend performing .isalpha() on the list of tokens or on the original pharma string itself? When I use .isalpha() on the string, my output is an empty "", but when I try to use it on the tokens, I get an attribute error.

I am pretty sure .isalpha requires a string, not a list. So if we are tokenizing and creating a list then I think we won't be able to use .isalpha, unless you join the list together to form a string.

yep exactly! .isalpha() should be used on an individual string as in the example code here: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/fe2cd18ce64335214e3a3ce864ba7d8818e4c313/activities/s21_activities/06_textasdata_partII_topicmodeling_examplecode.ipynb

if you're running into an attribute error, it might be trying to check it on a non-str token so you can wrap the token in str to fix