Sort out tone_measure inconsistencies and version issues

Yvonne-Han commented 4 years ago

Regarding the differences in the tone output, this might be due to differences in the dictionaries used. I think that the old dictionary is embedded in a PostgreSQL table (see old code). So it should be possible to compare the word lists. Let me know if you need help with that. If it's differences in word-lists, I think we'd drop the existing table and do it all again (and perhaps think of a way of doing some kind of version note for the word lists).

_Originally posted by @iangow in https://github.com/iangow/se_features/issues/26#issuecomment-629275167_

Yvonne-Han commented 4 years ago

@iangow I found the old dictionary (in .csv format) here (which is generated by the get_data.R here). May I ask which dictionary did you use to create the functions in the new package?

At first I assumed that you re-ran the get_data.py to get the new dictionary so I tried to compare what the code generates NOW versus the old dictionary, but they seem to be the same:

library(readr)
library(dplyr)
library(stringr)

# Old dict adapted from the se_features/tone_measure folder
lm_words_old <- as_tibble(read.csv("tone_measure/lm_words.csv", header = TRUE, stringsAsFactors = FALSE)) 

# New dict generated by re-running the get_data.R code
lm_words_new <- as_tibble(read.csv("tone_measure/lm_words_new.csv", header = TRUE, stringsAsFactors = FALSE)) %>%
    mutate_if(is.character, str_replace_all, pattern = ",", replacement = ", ")  

# Figure out whether the two dictionaries are different
lm_words_old %>%
    full_join(lm_words_new, by = "category") %>%
    filter(words.x != words.y)
#> # A tibble: 0 x 5
#> # … with 5 variables: category <chr>, url.x <chr>, words.x <chr>,
#> #   url.y <chr>, words.y <chr>

^{Created on 2020-05-16 by the reprex package (v0.3.0)}

iangow commented 4 years ago

I think lm_words.csv is new. (I recommend using read_csv from readr BTW; the base functions have minor issues.) So you're comparing new with new. But looking into it a bit more, there's no real "old" data here. So I conjecture that the cause is something other than word lists (though I'm at a loss to work out what that is).

It might be better to find a call with a relatively short pieces of text that have non-zero values for the tone measures where we're seeing differences and then investigate those. Once we understand what's causing the difference, we can make a call as to what to do. Thanks.

Yvonne-Han commented 4 years ago

I think lm_words.csv is new. (I recommend using read_csv from readr BTW; the base functions have minor issues.) So you're comparing new with new. But looking into it a bit more, there's no real "old" data here. So I conjecture that the cause is something other than word lists (though I'm at a loss to work out what that is).

@iangow I think so too. I actually went back to the Loughran and McDonald website and confirmed that the most recent update of their word list was in 2018 (which was, before you wrote the "old code"), so it shouldn't be caused by the differences in dictionaries.

It might be better to find a call with a relatively short pieces of text that have non-zero values for the tone measures where we're seeing differences and then investigate those. Once we understand what's causing the difference, we can make a call as to what to do. Thanks.

No worries at all 😁 I will see what I can do.

Yvonne-Han commented 4 years ago

@iangow I think I know what's causing the differences now! See the updated notebook for all details but here is a quick summary made for you:

The word list is the same, but the differences in outputs arise when you translate the word list into regex_dict in your functions (It is a quite trivial one so it took me a while to figure this out when writing test cases...)

This is what your regex_dict in your original tone_measure functions looks like:

{'litigious': re.compile(r'\b(?:abrogate| abrogated| abrogates| abrogating...}

While this is what your new regex_dict in the new package looks like:

{'litigious': re.compile(r'\b(?:abovementioned|abrogate|abrogated|abrogates|abrogating...}

So that extra space before each word will ignore all matches that appear as the first word of each paragraph (in our case, that would be, each speaker_text). For example:

Text: "Good morning." (word `good` is a positive word).
tone_orig: {'positive': 0}
tone_new: {'positive': 1}

However, this won't affect any matches that are in the middle of a paragraph. Another example:

Text: "Let's get started. Good morning." (word `good` is a positive word).
tone_orig: {'positive': 1}
tone_new: {'positive': 1}

Yvonne-Han commented 4 years ago

@iangow In terms of the next step, given that we are getting rid of the old tone_measure_functions, I guess we can just drop the original table and re-run tone_measure with your new functions (I've already updated tone_measure_add and tone_measure_run to use functions from the new package)?

iangow commented 4 years ago

Sounds good. I think the "new" behaviour is what we want. So just delete and re-run whenever.

Yvonne-Han commented 4 years ago

Sounds good. I think the "new" behaviour is what we want. So just delete and re-run whenever.

Sure. I will keep track of these and mark them as completed as appropriate.

Yvonne-Han commented 4 years ago

I've deleted the original tone_measure table in se_features.

crsp=> DROP TABLE se_features.tone_measure;
DROP TABLE

Running tone_measure_run.py now (2020-05-24 14:30:27 AEST) on 474,207 files.

Yvonne-Han commented 4 years ago

@iangow Done! I think at this stage, the code and tables in se_features are all updated. The old functions are also replaced by ling_features functions if applicable.

I'm closing this issue now (Please reopen it if you have anything else to add).

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO se_features")

tone_measure <- tbl(pg, "tone_measure")

tone_measure
#> # Source:   table<tone_measure> [?? x 11]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_number context section positive
#>    <chr>     <dttm>                       <int> <chr>     <int>    <int>
#>  1 11118280… 2018-01-20 05:27:51             31 qa            1        0
#>  2 11118280… 2018-01-20 05:27:51             30 qa            1        0
#>  3 11118280… 2018-01-20 05:27:51             29 qa            1        2
#>  4 11118280… 2018-01-20 05:27:51             28 qa            1        3
#>  5 11118280… 2018-01-20 05:27:51             27 qa            1        1
#>  6 11118280… 2018-01-20 05:27:51             26 qa            1        1
#>  7 11118280… 2018-01-20 05:27:51             25 qa            1        0
#>  8 11118280… 2018-01-20 05:27:51             24 qa            1        0
#>  9 11118280… 2018-01-20 05:27:51             23 qa            1        0
#> 10 11118280… 2018-01-20 05:27:51             22 qa            1        0
#> # … with more rows, and 5 more variables: negative <int>,
#> #   uncertainty <int>, litigious <int>, modal_strong <int>,
#> #   modal_weak <int>

tone_measure %>%
  select(file_name) %>%
  distinct() %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 474207

^{Created on 2020-05-24 by the reprex package (v0.3.0)}

iangow / se_features

Sort out tone_measure inconsistencies and version issues #31