Closed Yvonne-Han closed 4 years ago
@iangow I found the old dictionary (in .csv format) here (which is generated by the get_data.R
here). May I ask which dictionary did you use to create the functions in the new package?
At first I assumed that you re-ran the get_data.py
to get the new dictionary so I tried to compare what the code generates NOW versus the old dictionary, but they seem to be the same:
library(readr)
library(dplyr)
library(stringr)
# Old dict adapted from the se_features/tone_measure folder
lm_words_old <- as_tibble(read.csv("tone_measure/lm_words.csv", header = TRUE, stringsAsFactors = FALSE))
# New dict generated by re-running the get_data.R code
lm_words_new <- as_tibble(read.csv("tone_measure/lm_words_new.csv", header = TRUE, stringsAsFactors = FALSE)) %>%
mutate_if(is.character, str_replace_all, pattern = ",", replacement = ", ")
# Figure out whether the two dictionaries are different
lm_words_old %>%
full_join(lm_words_new, by = "category") %>%
filter(words.x != words.y)
#> # A tibble: 0 x 5
#> # … with 5 variables: category <chr>, url.x <chr>, words.x <chr>,
#> # url.y <chr>, words.y <chr>
Created on 2020-05-16 by the reprex package (v0.3.0)
I think lm_words.csv
is new. (I recommend using read_csv
from readr
BTW; the base functions have minor issues.) So you're comparing new with new. But looking into it a bit more, there's no real "old" data here. So I conjecture that the cause is something other than word lists (though I'm at a loss to work out what that is).
It might be better to find a call with a relatively short pieces of text that have non-zero values for the tone measures where we're seeing differences and then investigate those. Once we understand what's causing the difference, we can make a call as to what to do. Thanks.
I think
lm_words.csv
is new. (I recommend usingread_csv
fromreadr
BTW; the base functions have minor issues.) So you're comparing new with new. But looking into it a bit more, there's no real "old" data here. So I conjecture that the cause is something other than word lists (though I'm at a loss to work out what that is).
@iangow I think so too. I actually went back to the Loughran and McDonald website and confirmed that the most recent update of their word list was in 2018 (which was, before you wrote the "old code"), so it shouldn't be caused by the differences in dictionaries.
It might be better to find a call with a relatively short pieces of text that have non-zero values for the tone measures where we're seeing differences and then investigate those. Once we understand what's causing the difference, we can make a call as to what to do. Thanks.
No worries at all 😁 I will see what I can do.
@iangow I think I know what's causing the differences now! See the updated notebook for all details but here is a quick summary made for you:
The word list is the same, but the differences in outputs arise when you translate the word list into regex_dict
in your functions (It is a quite trivial one so it took me a while to figure this out when writing test cases...)
This is what your regex_dict
in your original tone_measure functions looks like:
{'litigious': re.compile(r'\b(?:abrogate| abrogated| abrogates| abrogating...}
While this is what your new regex_dict in the new package looks like:
{'litigious': re.compile(r'\b(?:abovementioned|abrogate|abrogated|abrogates|abrogating...}
So that extra space before each word will ignore all matches that appear as the first word of each paragraph (in our case, that would be, each speaker_text
). For example:
Text: "Good morning." (word `good` is a positive word).
tone_orig: {'positive': 0}
tone_new: {'positive': 1}
However, this won't affect any matches that are in the middle of a paragraph. Another example:
Text: "Let's get started. Good morning." (word `good` is a positive word).
tone_orig: {'positive': 1}
tone_new: {'positive': 1}
@iangow In terms of the next step, given that we are getting rid of the old tone_measure_functions
, I guess we can just drop the original table and re-run tone_measure
with your new functions (I've already updated tone_measure_add
and tone_measure_run
to use functions from the new package)?
Sounds good. I think the "new" behaviour is what we want. So just delete and re-run whenever.
Sounds good. I think the "new" behaviour is what we want. So just delete and re-run whenever.
Sure. I will keep track of these and mark them as completed as appropriate.
I've deleted the original tone_measure
table in se_features
.
crsp=> DROP TABLE se_features.tone_measure;
DROP TABLE
Running tone_measure_run.py
now (2020-05-24 14:30:27 AEST) on 474,207 files.
@iangow Done! I think at this stage, the code and tables in se_features
are all updated. The old functions are also replaced by ling_features
functions if applicable.
I'm closing this issue now (Please reopen it if you have anything else to add).
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO se_features")
tone_measure <- tbl(pg, "tone_measure")
tone_measure
#> # Source: table<tone_measure> [?? x 11]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> file_name last_update speaker_number context section positive
#> <chr> <dttm> <int> <chr> <int> <int>
#> 1 11118280… 2018-01-20 05:27:51 31 qa 1 0
#> 2 11118280… 2018-01-20 05:27:51 30 qa 1 0
#> 3 11118280… 2018-01-20 05:27:51 29 qa 1 2
#> 4 11118280… 2018-01-20 05:27:51 28 qa 1 3
#> 5 11118280… 2018-01-20 05:27:51 27 qa 1 1
#> 6 11118280… 2018-01-20 05:27:51 26 qa 1 1
#> 7 11118280… 2018-01-20 05:27:51 25 qa 1 0
#> 8 11118280… 2018-01-20 05:27:51 24 qa 1 0
#> 9 11118280… 2018-01-20 05:27:51 23 qa 1 0
#> 10 11118280… 2018-01-20 05:27:51 22 qa 1 0
#> # … with more rows, and 5 more variables: negative <int>,
#> # uncertainty <int>, litigious <int>, modal_strong <int>,
#> # modal_weak <int>
tone_measure %>%
select(file_name) %>%
distinct() %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 474207
Created on 2020-05-24 by the reprex package (v0.3.0)
Regarding the differences in the tone output, this might be due to differences in the dictionaries used. I think that the old dictionary is embedded in a PostgreSQL table (see old code). So it should be possible to compare the word lists. Let me know if you need help with that. If it's differences in word-lists, I think we'd drop the existing table and do it all again (and perhaps think of a way of doing some kind of version note for the word lists).
_Originally posted by @iangow in https://github.com/iangow/se_features/issues/26#issuecomment-629275167_