iangow / se_features

Linguistic features derived from StreetEvents
1 stars 3 forks source link

Create tone measure #5

Closed iangow closed 5 years ago

iangow commented 6 years ago

This is the same issue as https://github.com/azakolyukina/bs_linguistics/issues/49, but let's work on and discuss the issue here. I will add more detailed instructions soon.

iangow commented 6 years ago

@danielacarrasco

One step of this process will be getting the list of words for each tone category. The following R code does this ... but we will probably want to get the data into a form that can be used in Python (I added the last couple of lines to this end).

category <- c("positive", "negative", "uncertainty",
                "litigious", "modal_strong", "modal_weak")

base_url <- "http://www3.nd.edu/~mcdonald/Data/Finance_Word_Lists"

url <- file.path(base_url,
                 c("LoughranMcDonald_Positive.csv",
                   "LoughranMcDonald_Negative.csv",
                   "LoughranMcDonald_Uncertainty.csv",
                   "LoughranMcDonald_Litigious.csv",
                   "LoughranMcDonald_ModalStrong.csv",
                   "LoughranMcDonald_ModalWeak.csv"))

df <- data.frame(category, url, stringsAsFactors=FALSE)

getWords <- function(url) {
    words <- read.csv(url, as.is=TRUE)
    paste(words[,1], collapse=",")
}
df$words <- unlist(lapply(df$url, getWords))

library(readr)
write_csv(df, path = "lm_words.csv")
iangow commented 6 years ago
import pandas as pd
df = pd.read_csv("lm_words.csv")
df
category url words
0 positive http://www3.nd.edu/~mcdonald/Data/Finance_Word... ABUNDANCE,ABUNDANT,ACCLAIMED,ACCOMPLISH,ACCOMP...
1 negative http://www3.nd.edu/~mcdonald/Data/Finance_Word... ABANDONED,ABANDONING,ABANDONMENT,ABANDONMENTS,...
2 uncertainty http://www3.nd.edu/~mcdonald/Data/Finance_Word... ABEYANCES,ALMOST,ALTERATION,ALTERATIONS,AMBIGU...
3 litigious http://www3.nd.edu/~mcdonald/Data/Finance_Word... ABROGATE,ABROGATED,ABROGATES,ABROGATING,ABROGA...
4 modal_strong http://www3.nd.edu/~mcdonald/Data/Finance_Word... BEST,CLEARLY,DEFINITELY,DEFINITIVELY,HIGHEST,L...
5 modal_weak http://www3.nd.edu/~mcdonald/Data/Finance_Word... APPEARED,APPEARING,APPEARS,CONCEIVABLE,COULD,D...

import re

categories = [key for key in regex_dict.keys()]

def make_regex(words):
    word_list = words.lower().split(",")
    regex_text = '\\b(?:' + '|'.join(word_list) + ')\\b'
    regex = re.compile(regex_text)
    return regex

regex_dict = { cat: make_regex(df['words'][df['category'] == cat].iloc[0]) for cat in categories}
import json 

def tone_count(the_text):

    # rest of function
    """Function to return number of matches in a category in a text"""
    text = the_text.lower()
    the_dict = {category: len(re.findall(regex_dict[category], text)) for category in categories}
    return json.dumps(the_dict)
some_text = """I agree that my solution is more complex.
    But in part that’s because it’s a more complete solution. 
    One has to download and process the data from Bill MacDonald (“see his website for download” 
    implies undocumented steps in the process). 
    Then one has to organize and perhaps process the text so it can be fed to the Python function. 
    Finally, one needs to handle the output.

    I think the first step on my site could be done in Python (rather than R … my decision
    to use R is more a reflection of my comparative advantage in R than anything inherent to Python).
    And the second step could be done without PostgreSQL (especially if the first step is done in Python).
    I think a “pure Python” approach would be more elegant than what I have, at least as a code illustration."""
tone_count(some_text)
'{"positive": 1, "negative": 1, "uncertainty": 4, "litigious": 0, "modal_strong": 0, "modal_weak": 3}'
iangow commented 6 years ago

So I think the task is to make a .py file (say tone_functions.py that one can use by saying

from tone_functions import tone_count

I think there should be something like the following as a kind of test:

if __name__=="__main__":
    some_text = [see above]
    print(tone_count(some_text))
iangow commented 6 years ago

Note that "test text" was grabbed from here. I just searched "tone python" or something like that.

iangow commented 6 years ago

To give you a sense of the roadmap, take a look at:

This is a "template" of sorts that we will use here.

danielacarrasco commented 6 years ago

ProgrammingError: (psycopg2.ProgrammingError) permission denied for schema bs_linguistics

I get that error. Is it the same problem we had with linguistic_features (i.e. just a matter of access)?

iangow commented 5 years ago

@danielacarrasco

This is very similar to #9. I think we should create a folder tone with functions like what you made for #9 and then run it.

It might be easiest to start with the R code above and import the CSV file that creates. Later on, it should be easy to replace the R code with Python code and perhaps skip the creation of the CSV entirely.

danielacarrasco commented 5 years ago

I am running the code at the moment. I have already generated ~100000 tables at the moment. I modified the code following what you did with the fog tables. It would be great if you can have a look at them and let me know if they are ready. As soon as they are, I'll be starting with the ML.

iangow commented 5 years ago

The output looks good.