Investigate missing entries within a processed call

Yvonne-Han commented 4 years ago

@iangow One more thing that I just found today. Not sure how common it is. For a given call that is processed in word_counts, there can be some missing entries. To give you an example:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(tidyr)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
#> Warning: multiple methods tables found for 'dbQuoteLiteral'

rs <- dbExecute(pg, "SET search_path TO honours_yvonne, streetevents, se_features")

calls <- tbl(pg, "calls")
speaker_data <- tbl(pg, "speaker_data")
word_counts <- tbl(pg, "word_counts")

speaker_data %>%
  filter(file_name == "632090_T") 
#> # Source:   lazy query [?? x 9]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_name employer role  speaker_number
#>    <chr>     <dttm>              <chr>        <chr>    <chr>          <int>
#>  1 632090_T  2002-05-21 05:15:37 Moderator, … ""       ""                12
#>  2 632090_T  2002-05-21 05:15:37 Robert Lobb… FiberCo… CFO               11
#>  3 632090_T  2002-05-21 05:15:37 Mohd Aslami  FiberCo… CEO …             10
#>  4 632090_T  2002-05-21 05:15:37 Moderator, … ""       ""                 9
#>  5 632090_T  2002-05-21 05:15:37 Analyst,     ""       ""                 8
#>  6 632090_T  2002-05-21 05:15:37 Mohd Aslami  FiberCo… CEO …              7
#>  7 632090_T  2002-05-21 05:15:37 Analyst,     ""       ""                 6
#>  8 632090_T  2002-05-21 05:15:37 Robert Lobb… FiberCo… CFO                5
#>  9 632090_T  2002-05-21 05:15:37 Mohd Aslami  FiberCo… CEO …              4
#> 10 632090_T  2002-05-21 05:15:37 Mohd Aslami  FiberCo… CEO …              3
#> # … with more rows, and 3 more variables: speaker_text <chr>,
#> #   context <chr>, section <int>

speaker_data %>%
  filter(file_name == "632090_T") %>%
  anti_join(word_counts)
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
#> # Source:   lazy query [?? x 9]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   file_name last_update         speaker_name employer role  speaker_number
#>   <chr>     <dttm>              <chr>        <chr>    <chr>          <int>
#> 1 632090_T  2002-05-21 05:15:37 <NA>         ""       ""                 2
#> # … with 3 more variables: speaker_text <chr>, context <chr>,
#> #   section <int>

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

iangow commented 4 years ago

Maybe delete the data from word_counts and try again. Maybe this could be handled with #22.

Yvonne-Han commented 4 years ago

[x] Identify which calls are affected
[ ] Delete them and try again

Yvonne-Han commented 4 years ago

To figure out which calls have this issue, I did this:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

word_counts <- tbl(pg, "word_counts")
speaker_data <- tbl(pg, "speaker_data")

processed_calls <- word_counts %>%
  filter(speaker_number != 0) %>%
  select(file_name, last_update) %>% 
  distinct()

speaker_data_for_processed_calls <- speaker_data %>%
  select(-speaker_text) %>%
  inner_join(processed_calls)
#> Joining, by = c("file_name", "last_update")

speaker_data_for_processed_calls %>%
  anti_join(word_counts) %>%
  print(n=20)
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
#> # Source:   lazy query [?? x 8]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_name employer role  speaker_number
#>    <chr>     <dttm>              <chr>        <chr>    <chr>          <int>
#>  1 1002981_T 2005-01-26 10:26:30 <NA>         ""       ""               185
#>  2 1003431_T 2005-01-28 04:58:54 <NA>         ""       ""                 1
#>  3 1014496_T 2005-02-16 23:50:25 <NA>         ""       ""               118
#>  4 1017321_T 2005-03-10 08:11:10 <NA>         ""       ""                 1
#>  5 1024065_T 2005-03-02 05:23:05 <NA>         ""       ""                 2
#>  6 1025829_T 2005-03-11 12:03:34 <NA>         ""       ""                 1
#>  7 1036691_T 2005-04-13 03:52:48 <NA>         ""       ""               109
#>  8 1045573_T 2005-04-20 16:22:31 <NA>         ""       ""                78
#>  9 1048662_T 2005-04-22 07:23:38 <NA>         ""       ""                23
#> 10 1050494_T 2005-05-06 03:42:04 <NA>         ""       ""                 1
#> 11 1050971_T 2005-04-29 16:21:39 <NA>         ""       ""               116
#> 12 1051800_T 2005-04-27 13:33:29 <NA>         ""       ""                30
#> 13 1051800_T 2005-04-27 13:33:29 <NA>         ""       ""                28
#> 14 1051800_T 2005-04-27 13:33:29 <NA>         ""       ""                23
#> 15 1054727_T 2005-05-04 22:27:37 <NA>         ""       ""               142
#> 16 1054793_T 2005-05-10 14:22:45 <NA>         ""       ""                 6
#> 17 1061671_T 2005-05-10 04:31:12 <NA>         ""       ""                 4
#> 18 1062947_T 2005-05-12 14:01:29 <NA>         ""       ""                 6
#> 19 1063141_T 2005-05-16 23:06:45 <NA>         ""       ""                90
#> 20 1067517_T 2005-05-17 01:01:52 <NA>         ""       ""                 5
#> # … with more rows, and 2 more variables: context <chr>, section <int>

affected_calls <- speaker_data_for_processed_calls %>%
  anti_join(word_counts) %>%
  select(file_name) %>%
  distinct()
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")

affected_calls %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 335

^{Created on 2020-04-26 by the reprex package (v0.3.0)}

It might be worth investigating why all of these missing entries have NA or empty strings for speaker_name, role and employer.

Yvonne-Han commented 4 years ago

Okay... I can see why these entries were skipped when creating word_counts:

https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_add.py#L20-L32

@iangow I can see that you only create word_counts for speaker_data where speaker_name IS NOT NULL in Line30 (Should be the same when you created other se_features tables). https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_add.py#L30

So I guess my question is: Do we want to keep this as it is? If so, there's nothing to fix.

iangow / se_features

Investigate missing entries within a processed call #23