Compare LIWC counts - Githubissues

iangow commented 5 years ago

@Yvonne-Han The code below pulls data from the LIWC table I started to create and also the text underlying those data. I think the task here is to put the text (if you run the code below, it will be saved in my_data.txt on your computer) in the LIWC software and compare the output.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO se_features, streetevents")

speaker_data <- tbl(pg, "speaker_data")
liwc_2015_output <- tbl(pg, "liwc_2015_output")
speaker_data %>% 
    filter(file_name == "1802211_T", context == "pres", speaker_number == 3) %>% 
    select(speaker_text) %>%
    pull() %>% 
    cat(file = "my_data.txt")

sample_liwc <- 
    liwc_2015_output %>% 
    filter(file_name == "1802211_T", context == "pres", speaker_number == 3) %>%
    collect()

sample_liwc %>%
    select(-1:-6) %>%
    t()
#>              [,1]
#> Function     6289
#> Pronoun      1698
#> Ppron        1034
#> I             130
#> We            648
#> You           129
#> SheHe           9
#> They          118
#> Ipron         664
#> Article       898
#> Prep         1607
#> Auxverb      1191
#> Power         421
#> Adverb        602
#> Conj          730
#> Negate        138
#> Verb         2009
#> Adj           555
#> Compare       304
#> Interrog      157
#> Number        147
#> Quant         326
#> Affect        720
#> Posemo        528
#> Negemo        171
#> Anx            60
#> Anger          14
#> Sad            35
#> Social       1242
#> Family          0
#> Friend          4
#> Female          0
#> Male           11
#> CogProc      1360
#> Insight       243
#> Cause         266
#> Discrep       117
#> Tentat        374
#> Certain       152
#> Differ        326
#> Percept       117
#> See            72
#> Hear           21
#> Feel           10
#> Bio            51
#> Body            6
#> Health         36
#> Sexual          2
#> Ingest         11
#> Drives       1691
#> Affiliation   745
#> Achieve       342
#> Reward        246
#> Risk          119
#> FocusPast     476
#> FocusPresent 1349
#> FocusFuture   150
#> Relativ      1737
#> Motion        242
#> Space        1009
#> Time          507
#> Work          860
#> Leisure        30
#> Home           42
#> Money         612
#> Relig           3
#> Death           3
#> Informal       77
#> Swear           0
#> Netspeak       60
#> Assent          3
#> Nonflu         14
#> Filler          1

^{Created on 2019-07-30 by the reprex package (v0.3.0)}

iangow commented 5 years ago

Note I chose the passage as it was the longest one of the 3,000+ conference calls I processed (I just ran the code for a little while, then stopped it, as it's not worthwhile to run the code for a day or so and then have to run it again if there are issues).

iangow commented 5 years ago

@Yvonne-Han Hmm. A bit different. Maybe it would be better to start with a shorter passage to nail down the differences.

library(readxl)
library(tidyr)
library(dplyr, warn.conflicts = FALSE)

liwc_orig <- read_excel("~/Downloads/LIWC2015 Results (my_data.txt).xlsx")
to_compare <-
    liwc_orig %>%
    select(-Filename, -Segment) %>%
    mutate_at(.vars = vars(-WC), funs(./100 * WC)) %>%
    select(-WC:-Dic) %>%
    gather(key = "category", value = "liwc_orig") %>%
    mutate(category = tolower(category))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas: 
#> 
#>   # Simple named list: 
#>   list(mean = mean, median = median)
#> 
#>   # Auto named with `tibble::lst()`: 
#>   tibble::lst(mean, median)
#> 
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO se_features, streetevents")

liwc_2015_output <- tbl(pg, "liwc_2015_output")

sample_liwc <- 
    liwc_2015_output %>% 
    filter(file_name == "1802211_T", context == "pres", speaker_number == 3) %>%
    select(-1:-6) %>%
    collect() %>%
    gather(key = "category", value = "liwc_alt") %>%
    mutate(category = tolower(category))

sample_liwc %>%
    inner_join(to_compare)
#> Joining, by = "category"
#> # A tibble: 73 x 3
#>    category liwc_alt liwc_orig
#>    <chr>       <int>     <dbl>
#>  1 function     6289   6230.  
#>  2 pronoun      1698   1673.  
#>  3 ppron        1034   1010.  
#>  4 i             130    129.  
#>  5 we            648    649.  
#>  6 you           129    122.  
#>  7 shehe           9      9.54
#>  8 they          118    103.  
#>  9 ipron         664    663.  
#> 10 article       898    898.  
#> # … with 63 more rows

^{Created on 2019-07-29 by the reprex package (v0.3.0)}

Yvonne-Han commented 5 years ago

@iangow I also had a look at the data and found they are different. For some categories, I think it is working as intended (especially for those results of 0), but for some of the others there seem to be a huge gap (for example, the i category). The LIWC software has another function: It can highlight the specific words that belong to one category. Do you think it might help in determining the issues?

iangow commented 5 years ago

@Yvonne-Han Probably best to communicate rather than using email. I have updated the comments with code output above to reflect the new approach. The numbers are much closer now, but not identical.

iangow commented 5 years ago

I think you want to choose a passage of text with fewer words for comparison.

Yvonne-Han commented 5 years ago

@iangow Yes, it seems much closer now, but given the differences are not proportional, I would guess that there are still some issues with the dictionary. Let me get back to you if I find anything else.

iangow commented 5 years ago

@Yvonne-Han Focus on utterances where there are a few words in a category (so that the comparison can detect issues), but not so many that it becomes laborious. Also, I'd suggest focusing on a handful of LIWC categories at first (eliminate generic issues before focusing on ones within categories, if any).

Here are some candidates (I assume here that I don't need to filter on last_update [sometimes more than one per file_name] or section, which is 1 for every row in over 99% of calls). You could adapt code I sent earlier to dump text from the database (let me know if you need help with this part).


library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO se_features, streetevents")

liwc_2015_output <- tbl(pg, "liwc_2015_output")

liwc_2015_output %>%
    distinct(file_name) %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>       n
#>   <int>
#> 1    74

liwc_2015_output %>%
    filter(Ppron == 13L) %>%
    select(file_name, context, speaker_number)
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name  context speaker_number
#>    <chr>      <chr>            <int>
#>  1 12140978_T pres                 2
#>  2 1545010_T  qa                  57
#>  3 1545010_T  qa                  53
#>  4 1545010_T  qa                  22
#>  5 5284715_T  qa                  43
#>  6 5284715_T  pres                 5
#>  7 5081592_T  qa                  21
#>  8 5081592_T  pres                 2
#>  9 836096_T   qa                  99
#> 10 4181352_T  qa                  30
#> # … with more rows

^{Created on 2019-07-29 by the reprex package (v0.3.0)}

Yvonne-Han commented 5 years ago

@iangow Thanks, Ian! I've got one more question about generating liwc_2015_output. Could it be that we got the dictionary right, but did something different when applying it to the calls? (For example, when you code the table, do you use regex? Could it be different from what LIWC software is using?) I cross-examined the dictionary for some of the categories that are producing different results, but it seems that the extracted dictionary is the same as the pdf.

iangow commented 5 years ago

@iangow Thanks, Ian! I've got one more question about generating liwc_2015_output. Could it be that we got the dictionary right, but did something different when applying it to the calls? (For example, when you code the table, do you use regex? Could it be different from what LIWC software is using?) I cross-examined the dictionary for some of the categories that are producing different results, but it seems that the extracted dictionary is the same as the pdf.

For sure. But that's why we need to compare the numbers. I think it's harder to think of the issues in the abstract ("theory") than it is to take a more direct approach ("empirical"). We would not have picked up the "missing first word" issue using "theory".

Regarding what I do, it's all here.

iangow commented 5 years ago

Maybe take the first row above (file_name=='12140978_T', context=='pres', speaker_number==2)

Yvonne-Han commented 5 years ago

Sure. Let me have a look and get back to you.

Yvonne-Han commented 5 years ago

@iangow I think I might have found another problem here. Our liwc_2015_output table seems to count those words that CONTAIN the dictionary words, while the software seems to only count those EXACT words.

Example input: Okay. And then, as you look at the 132 franchise agreements signed year-to-date, it looks like about at least through this third quarter, slightly less than half were net new agreements versus renewals or conversions. The new guys coming in, what's the mix of brands they're choosing? Are they -- where kind of in the scale are they kind of economy up? And what's the mix of brands they're coming over for?

LIWC software output: 2 words for the 'they' category (in bold) LIWC_alt output: 4 words for the 'they' category (difference in italics)

iangow commented 5 years ago

OK. I would have to tweak the code for liwc_alt (see above) to retain the matched words. But this shouldn't be too difficult. Basically, it might involve creation of a modified version of this code.

iangow commented 5 years ago

Actually, this code requires that the text be inside word boundaries (\b in regex). It may be that ' in they're triggers a word boundary in the Python regex, but not in the LIWC software.

https://github.com/iangow/se_features/blob/f524e1114bc75061f7c218ee0e143a98fcb38330/liwc_2015/liwc_functions.py#L33

iangow commented 5 years ago

See here for a test notebook.

I wonder if there are other characters like ' that LIWC treats as word-breaks.
I wonder if ' also counts as a word-break for other words (for example, what's in the sample texts ... is there an LIWC category with what in it?).

Yvonne-Han commented 5 years ago

@iangow I had a look at the dictionary. For most of the words that contain a ', they also include the word without ', and that's why it only triggers issues for some of the categories. (e.g., both what's and whats are in the same category: verb).

iangow commented 5 years ago

@Yvonne-Han It seems I broke my ' code in fixing that other issue. Let me try again.

Yvonne-Han commented 4 years ago

See #37 for more details. Closing this one.

iangow / se_features

Compare LIWC counts #17