Closed iangow closed 5 years ago
@danielacarrasco
One step of this process will be getting the list of words for each tone category. The following R code does this ... but we will probably want to get the data into a form that can be used in Python (I added the last couple of lines to this end).
category <- c("positive", "negative", "uncertainty",
"litigious", "modal_strong", "modal_weak")
base_url <- "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists"
url <- file.path(base_url,
c("LoughranMcDonald_Positive.csv",
"LoughranMcDonald_Negative.csv",
"LoughranMcDonald_Uncertainty.csv",
"LoughranMcDonald_Litigious.csv",
"LoughranMcDonald_ModalStrong.csv",
"LoughranMcDonald_ModalWeak.csv"))
df <- data.frame(category, url, stringsAsFactors=FALSE)
getWords <- function(url) {
words <- read.csv(url, as.is=TRUE)
paste(words[,1], collapse=",")
}
df$words <- unlist(lapply(df$url, getWords))
library(readr)
write_csv(df, path = "lm_words.csv")
import pandas as pd
df = pd.read_csv("lm_words.csv")
df
category | url | words | |
---|---|---|---|
0 | positive | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | ABUNDANCE,ABUNDANT,ACCLAIMED,ACCOMPLISH,ACCOMP... |
1 | negative | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | ABANDONED,ABANDONING,ABANDONMENT,ABANDONMENTS,... |
2 | uncertainty | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | ABEYANCES,ALMOST,ALTERATION,ALTERATIONS,AMBIGU... |
3 | litigious | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | ABROGATE,ABROGATED,ABROGATES,ABROGATING,ABROGA... |
4 | modal_strong | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | BEST,CLEARLY,DEFINITELY,DEFINITIVELY,HIGHEST,L... |
5 | modal_weak | http://www3.nd.edu/~mcdonald/Data/Finance_Word... | APPEARED,APPEARING,APPEARS,CONCEIVABLE,COULD,D... |
import re
categories = [key for key in regex_dict.keys()]
def make_regex(words):
word_list = words.lower().split(",")
regex_text = '\\b(?:' + '|'.join(word_list) + ')\\b'
regex = re.compile(regex_text)
return regex
regex_dict = { cat: make_regex(df['words'][df['category'] == cat].iloc[0]) for cat in categories}
import json
def tone_count(the_text):
# rest of function
"""Function to return number of matches in a category in a text"""
text = the_text.lower()
the_dict = {category: len(re.findall(regex_dict[category], text)) for category in categories}
return json.dumps(the_dict)
some_text = """I agree that my solution is more complex.
But in part that’s because it’s a more complete solution.
One has to download and process the data from Bill MacDonald (“see his website for download”
implies undocumented steps in the process).
Then one has to organize and perhaps process the text so it can be fed to the Python function.
Finally, one needs to handle the output.
I think the first step on my site could be done in Python (rather than R … my decision
to use R is more a reflection of my comparative advantage in R than anything inherent to Python).
And the second step could be done without PostgreSQL (especially if the first step is done in Python).
I think a “pure Python” approach would be more elegant than what I have, at least as a code illustration."""
tone_count(some_text)
'{"positive": 1, "negative": 1, "uncertainty": 4, "litigious": 0, "modal_strong": 0, "modal_weak": 3}'
So I think the task is to make a .py
file (say tone_functions.py
that one can use by saying
from tone_functions import tone_count
I think there should be something like the following as a kind of test:
if __name__=="__main__":
some_text = [see above]
print(tone_count(some_text))
Note that "test text" was grabbed from here. I just searched "tone python" or something like that.
To give you a sense of the roadmap, take a look at:
word_count_functions.py
, which creates functions imported by ...word_count_add.py
, which is called by ...word_count_run.py
, which is the "main" program for getting word counts for data in StreetEvents (streetevents.speaker_data
).This is a "template" of sorts that we will use here.
ProgrammingError: (psycopg2.ProgrammingError) permission denied for schema bs_linguistics
I get that error. Is it the same problem we had with linguistic_features (i.e. just a matter of access)?
@danielacarrasco
This is very similar to #9. I think we should create a folder tone
with functions like what you made for #9 and then run it.
It might be easiest to start with the R code above and import the CSV file that creates. Later on, it should be easy to replace the R code with Python code and perhaps skip the creation of the CSV entirely.
I am running the code at the moment. I have already generated ~100000 tables at the moment. I modified the code following what you did with the fog tables. It would be great if you can have a look at them and let me know if they are ready. As soon as they are, I'll be starting with the ML.
The output looks good.
This is the same issue as https://github.com/azakolyukina/bs_linguistics/issues/49, but let's work on and discuss the issue here. I will add more detailed instructions soon.