Open Yvonne-Han opened 4 years ago
Maybe delete the data from word_counts
and try again. Maybe this could be handled with #22.
To figure out which calls have this issue, I did this:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")
word_counts <- tbl(pg, "word_counts")
speaker_data <- tbl(pg, "speaker_data")
processed_calls <- word_counts %>%
filter(speaker_number != 0) %>%
select(file_name, last_update) %>%
distinct()
speaker_data_for_processed_calls <- speaker_data %>%
select(-speaker_text) %>%
inner_join(processed_calls)
#> Joining, by = c("file_name", "last_update")
speaker_data_for_processed_calls %>%
anti_join(word_counts) %>%
print(n=20)
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
#> # Source: lazy query [?? x 8]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> file_name last_update speaker_name employer role speaker_number
#> <chr> <dttm> <chr> <chr> <chr> <int>
#> 1 1002981_T 2005-01-26 10:26:30 <NA> "" "" 185
#> 2 1003431_T 2005-01-28 04:58:54 <NA> "" "" 1
#> 3 1014496_T 2005-02-16 23:50:25 <NA> "" "" 118
#> 4 1017321_T 2005-03-10 08:11:10 <NA> "" "" 1
#> 5 1024065_T 2005-03-02 05:23:05 <NA> "" "" 2
#> 6 1025829_T 2005-03-11 12:03:34 <NA> "" "" 1
#> 7 1036691_T 2005-04-13 03:52:48 <NA> "" "" 109
#> 8 1045573_T 2005-04-20 16:22:31 <NA> "" "" 78
#> 9 1048662_T 2005-04-22 07:23:38 <NA> "" "" 23
#> 10 1050494_T 2005-05-06 03:42:04 <NA> "" "" 1
#> 11 1050971_T 2005-04-29 16:21:39 <NA> "" "" 116
#> 12 1051800_T 2005-04-27 13:33:29 <NA> "" "" 30
#> 13 1051800_T 2005-04-27 13:33:29 <NA> "" "" 28
#> 14 1051800_T 2005-04-27 13:33:29 <NA> "" "" 23
#> 15 1054727_T 2005-05-04 22:27:37 <NA> "" "" 142
#> 16 1054793_T 2005-05-10 14:22:45 <NA> "" "" 6
#> 17 1061671_T 2005-05-10 04:31:12 <NA> "" "" 4
#> 18 1062947_T 2005-05-12 14:01:29 <NA> "" "" 6
#> 19 1063141_T 2005-05-16 23:06:45 <NA> "" "" 90
#> 20 1067517_T 2005-05-17 01:01:52 <NA> "" "" 5
#> # … with more rows, and 2 more variables: context <chr>, section <int>
affected_calls <- speaker_data_for_processed_calls %>%
anti_join(word_counts) %>%
select(file_name) %>%
distinct()
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
affected_calls %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 335
Created on 2020-04-26 by the reprex package (v0.3.0)
It might be worth investigating why all of these missing entries have NA
or empty strings for
speaker_name
, role
and employer
.
Okay... I can see why these entries were skipped when creating word_counts
:
@iangow I can see that you only create word_counts
for speaker_data
where speaker_name IS NOT NULL
in Line30 (Should be the same when you created other se_features
tables).
https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_add.py#L30
So I guess my question is: Do we want to keep this as it is? If so, there's nothing to fix.
@iangow One more thing that I just found today. Not sure how common it is. For a given call that is processed in
word_counts
, there can be some missing entries. To give you an example:Created on 2020-04-20 by the reprex package (v0.3.0)