Closed iangow closed 4 years ago
Note I chose the passage as it was the longest one of the 3,000+ conference calls I processed (I just ran the code for a little while, then stopped it, as it's not worthwhile to run the code for a day or so and then have to run it again if there are issues).
@Yvonne-Han Hmm. A bit different. Maybe it would be better to start with a shorter passage to nail down the differences.
library(readxl)
library(tidyr)
library(dplyr, warn.conflicts = FALSE)
liwc_orig <- read_excel("~/Downloads/LIWC2015 Results (my_data.txt).xlsx")
to_compare <-
liwc_orig %>%
select(-Filename, -Segment) %>%
mutate_at(.vars = vars(-WC), funs(./100 * WC)) %>%
select(-WC:-Dic) %>%
gather(key = "category", value = "liwc_orig") %>%
mutate(category = tolower(category))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO se_features, streetevents")
liwc_2015_output <- tbl(pg, "liwc_2015_output")
sample_liwc <-
liwc_2015_output %>%
filter(file_name == "1802211_T", context == "pres", speaker_number == 3) %>%
select(-1:-6) %>%
collect() %>%
gather(key = "category", value = "liwc_alt") %>%
mutate(category = tolower(category))
sample_liwc %>%
inner_join(to_compare)
#> Joining, by = "category"
#> # A tibble: 73 x 3
#> category liwc_alt liwc_orig
#> <chr> <int> <dbl>
#> 1 function 6289 6230.
#> 2 pronoun 1698 1673.
#> 3 ppron 1034 1010.
#> 4 i 130 129.
#> 5 we 648 649.
#> 6 you 129 122.
#> 7 shehe 9 9.54
#> 8 they 118 103.
#> 9 ipron 664 663.
#> 10 article 898 898.
#> # … with 63 more rows
Created on 2019-07-29 by the reprex package (v0.3.0)
@iangow I also had a look at the data and found they are different. For some categories, I think it is working as intended (especially for those results of 0), but for some of the others there seem to be a huge gap (for example, the i category). The LIWC software has another function: It can highlight the specific words that belong to one category. Do you think it might help in determining the issues?
@Yvonne-Han Probably best to communicate rather than using email. I have updated the comments with code output above to reflect the new approach. The numbers are much closer now, but not identical.
I think you want to choose a passage of text with fewer words for comparison.
@iangow Yes, it seems much closer now, but given the differences are not proportional, I would guess that there are still some issues with the dictionary. Let me get back to you if I find anything else.
@Yvonne-Han Focus on utterances where there are a few words in a category (so that the comparison can detect issues), but not so many that it becomes laborious. Also, I'd suggest focusing on a handful of LIWC categories at first (eliminate generic issues before focusing on ones within categories, if any).
Here are some candidates (I assume here that I don't need to filter on last_update
[sometimes more than one per file_name
] or section
, which is 1
for every row in over 99% of calls). You could adapt code I sent earlier to dump text from the database (let me know if you need help with this part).
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO se_features, streetevents")
liwc_2015_output <- tbl(pg, "liwc_2015_output")
liwc_2015_output %>%
distinct(file_name) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int>
#> 1 74
liwc_2015_output %>%
filter(Ppron == 13L) %>%
select(file_name, context, speaker_number)
#> # Source: lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name context speaker_number
#> <chr> <chr> <int>
#> 1 12140978_T pres 2
#> 2 1545010_T qa 57
#> 3 1545010_T qa 53
#> 4 1545010_T qa 22
#> 5 5284715_T qa 43
#> 6 5284715_T pres 5
#> 7 5081592_T qa 21
#> 8 5081592_T pres 2
#> 9 836096_T qa 99
#> 10 4181352_T qa 30
#> # … with more rows
Created on 2019-07-29 by the reprex package (v0.3.0)
@iangow Thanks, Ian! I've got one more question about generating liwc_2015_output. Could it be that we got the dictionary right, but did something different when applying it to the calls? (For example, when you code the table, do you use regex? Could it be different from what LIWC software is using?) I cross-examined the dictionary for some of the categories that are producing different results, but it seems that the extracted dictionary is the same as the pdf.
@iangow Thanks, Ian! I've got one more question about generating liwc_2015_output. Could it be that we got the dictionary right, but did something different when applying it to the calls? (For example, when you code the table, do you use regex? Could it be different from what LIWC software is using?) I cross-examined the dictionary for some of the categories that are producing different results, but it seems that the extracted dictionary is the same as the pdf.
For sure. But that's why we need to compare the numbers. I think it's harder to think of the issues in the abstract ("theory") than it is to take a more direct approach ("empirical"). We would not have picked up the "missing first word" issue using "theory".
Regarding what I do, it's all here.
Maybe take the first row above (file_name=='12140978_T', context=='pres', speaker_number==2
)
Sure. Let me have a look and get back to you.
@iangow I think I might have found another problem here. Our liwc_2015_output table seems to count those words that CONTAIN the dictionary words, while the software seems to only count those EXACT words.
Example input: Okay. And then, as you look at the 132 franchise agreements signed year-to-date, it looks like about at least through this third quarter, slightly less than half were net new agreements versus renewals or conversions. The new guys coming in, what's the mix of brands they're choosing? Are they -- where kind of in the scale are they kind of economy up? And what's the mix of brands they're coming over for?
LIWC software output: 2 words for the 'they' category (in bold) LIWC_alt output: 4 words for the 'they' category (difference in italics)
OK. I would have to tweak the code for liwc_alt
(see above) to retain the matched words. But this shouldn't be too difficult. Basically, it might involve creation of a modified version of this code.
Actually, this code requires that the text be inside word boundaries (\b
in regex). It may be that '
in they're
triggers a word boundary in the Python regex, but not in the LIWC software.
See here for a test notebook.
'
that LIWC treats as word-breaks.'
also counts as a word-break for other words (for example, what's
in the sample texts ... is there an LIWC category with what
in it?).@iangow I had a look at the dictionary. For most of the words that contain a '
, they also include the word without '
, and that's why it only triggers issues for some of the categories.
(e.g., both what's
and whats
are in the same category: verb).
@Yvonne-Han It seems I broke my '
code in fixing that other issue. Let me try again.
See #37 for more details. Closing this one.
@Yvonne-Han The code below pulls data from the LIWC table I started to create and also the text underlying those data. I think the task here is to put the text (if you run the code below, it will be saved in
my_data.txt
on your computer) in the LIWC software and compare the output.Created on 2019-07-30 by the reprex package (v0.3.0)