Sort out `speaker_number ==0` entries

Yvonne-Han commented 4 years ago

@iangow When I was merging word_counts with speaker_data, I found that there were speaker_number == 0 entries in word_counts, which is really weird.

As you can see below, I've tested this on different se_features tables and it seems that this issue can be found in multiple tables, e.g., word_counts, tone_measure, fog_measure, etc.

library(dbplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(tidyverse)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

speaker_data <- tbl(pg, "speaker_data")
word_counts <- tbl(pg, "word_counts")
tone_measure <- tbl(pg, "tone_measure")
fog_measure <- tbl(pg, "fog_measure")

word_counts %>%
  filter(speaker_number == 0)
#> # Source:   lazy query [?? x 11]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_name speaker_number context
#>    <chr>     <dttm>              <chr>                 <int> <chr>  
#>  1 10341579… 2017-04-28 17:29:24 ""                        0 pres   
#>  2 3432892_T 2010-11-10 08:22:33 ""                        0 pres   
#>  3 5918037_T 2016-02-05 03:07:17 ""                        0 pres   
#>  4 1809544_T 2008-04-26 01:58:27 ""                        0 pres   
#>  5 5927038_T 2016-02-26 16:06:37 ""                        0 pres   
#>  6 1828917_T 2019-03-29 18:53:16 ""                        0 pres   
#>  7 842006_T  2004-01-31 12:10:02 ""                        0 pres   
#>  8 775879_T  2003-09-04 02:54:50 ""                        0 pres   
#>  9 7353511_T 2017-11-10 09:07:55 ""                        0 pres   
#> 10 2321799_T 2009-08-06 04:43:46 ""                        0 pres   
#> # … with more rows, and 6 more variables: section <int>, count <int>,
#> #   sum <int>, sent_count <int>, sum_6 <int>, sum_num <int>

tone_measure %>%
  filter(speaker_number == 0)
#> # Source:   lazy query [?? x 11]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_number context section positive
#>    <chr>     <dttm>                       <int> <chr>     <int>    <int>
#>  1 1435324_T 2006-12-13 03:56:35              0 pres          1       NA
#>  2 854506_T  2004-03-09 10:05:08              0 pres          1       NA
#>  3 3281614_T 2010-08-23 19:21:43              0 pres          1       NA
#>  4 5537178_T 2014-11-18 09:29:01              0 pres          1       NA
#>  5 10778297… 2017-08-05 05:13:10              0 pres          1       NA
#>  6 5628505_T 2015-02-18 02:53:29              0 pres          1       NA
#>  7 798002_T  2003-10-29 05:38:48              0 pres          1       NA
#>  8 10450104… 2017-08-12 09:56:05              0 pres          1       NA
#>  9 4212203_T 2011-10-28 23:01:10              0 pres          1       NA
#> 10 11158524… 2018-03-02 23:44:34              0 pres          1       NA
#> # … with more rows, and 5 more variables: negative <int>,
#> #   uncertainty <int>, litigious <int>, modal_strong <int>,
#> #   modal_weak <int>

fog_measure %>%
  filter(speaker_number == 0)
#> # Source:   lazy query [?? x 9]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_number context section   fog
#>    <chr>     <dttm>                       <int> <chr>     <int> <dbl>
#>  1 7367372_T 2017-11-24 03:16:43              0 pres          1    NA
#>  2 5856104_T 2016-01-26 09:06:24              0 pres          1    NA
#>  3 756244_T  2003-07-23 10:47:34              0 pres          1    NA
#>  4 1069220_T 2005-08-26 02:55:09              0 pres          1    NA
#>  5 11681138… 2018-08-17 22:39:41              0 pres          1    NA
#>  6 3194110_T 2010-09-08 06:13:04              0 pres          1    NA
#>  7 3001784_T 2010-04-30 14:46:53              0 pres          1    NA
#>  8 2107375_T 2009-03-10 07:04:40              0 pres          1    NA
#>  9 11592179… 2018-08-02 02:45:32              0 pres          1    NA
#> 10 1592381_T 2007-07-20 01:11:16              0 pres          1    NA
#> # … with more rows, and 3 more variables: complex_words <dbl>,
#> #   fog_words <dbl>, fog_sents <dbl>

^{Created on 2020-04-19 by the reprex package (v0.3.0)}

Yvonne-Han commented 4 years ago

Double confirmed that speaker_data doesn't have speaker_number == 0 entries so this should be an issue in our code when creating se_features.

library(dbplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(tidyverse)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

speaker_data <- tbl(pg, "speaker_data")
word_counts <- tbl(pg, "word_counts")
tone_measure <- tbl(pg, "tone_measure")
fog_measure <- tbl(pg, "fog_measure")

word_counts %>% 
  filter(speaker_number == 0) %>%
  count() 
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 7459

word_counts %>%
  filter(speaker_number == 0) %>%
  anti_join(speaker_data) %>%
  count()
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 7459

^{Created on 2020-04-19 by the reprex package (v0.3.0)}

iangow commented 4 years ago

I think you may want to check what the function returns when you run it on this call. If you create a Python Notebook in the directory where the files are then you should be able to import some_function from fog.fog or something like that.

Yvonne-Han commented 4 years ago

After trying a few things, I think this chunk of code below is probably the reason why speaker_number == 0 entries are created (same applies to other se_features tables).

https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_add.py#L42-L48

So the next step would be to figure out under which circumstances do we get len(speaker_data) == 0.

Yvonne-Han commented 4 years ago

When creating a list of files to be processed, we select file_name and max(last_update) from streetevents.calls:

https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_run.py#L88-L99

Therefore, whenever the max(last_update) in calls table is not matched with last_update in speaker_data table (the matching step is shown as below), it will lead to no speaker_data returned (i.e., len(speaker_data) == 0).

https://github.com/iangow/se_features/blob/e1a25e0bf8e9cfbbd093fd9183a7fcf5ddb12336/word_count/word_count_add.py#L20-L34

@iangow I guess this issue can be solved by:

[x] Deleting the speaker_number == 0 entries (because if they were sitting in the word_counts table, the code will not identify them as unprocessed calls and will skip them); and
[x] Running the code to create word_counts table again (since that the latest update is now matched in calls and speaker_data).
[x] Doing the same thing to other se_features tables.

iangow commented 4 years ago

Yes, I think that's the right approach. Maybe make an issue for each affected table (unless it's easy to handle all in one go). It should not take long to re-process the affected files.

Yvonne-Han commented 4 years ago

Waiting for #23 to decide whether we want to change word_counts_run.py before re-running it for these calls.

iangow commented 4 years ago

Waiting for #23 to decide whether we want to change word_counts_run.py before re-running it for these calls.

No need to wait for #23. Just run it again. We can come back to this after #23 and run again on the incremental calls (should be easy to do that then).

Yvonne-Han commented 4 years ago

Waiting for #23 to decide whether we want to change word_counts_run.py before re-running it for these calls.

No need to wait for #23. Just run it again. We can come back to this after #23 and run again on the incremental calls (should be easy to do that then).

Sure. Then I will run it again later today (probably tonight). It should be quick.

Yvonne-Han commented 4 years ago

This comment is created to keep track of the progress:

[x] word_counts
[x] tone_measure
[x] fog_measure

Updated 2020-04-30 00:18:35 AEST:

Deleted the NA entries in word_counts
Re-running word_counts_run.py

Updated 2020-04-30 21:57:10 AEST:

Deleted the NA entries in tone_measure
Re-running tone_measure_run.py

Updated 2020-05-01 23:54:24 AEST:

Deleted the NA entries in fog_measure
Re-running fog_run.py

Yvonne-Han commented 4 years ago

After deleting the speaker_number == 0 entries and re-running the code, the number of speaker_number == 0 entries has decreased (from ~7300 to ~1900), but there are still quite a few.

@iangow It seems that (one of) the issue(s) here is the records of these affected calls can be found in streetevents.calls table, but not in streetevents.speaker_data table. See below:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

word_counts <- tbl(pg, "word_counts")
speaker_data <- tbl(pg, "speaker_data")
calls <- tbl(pg, "calls")

affected_calls <- word_counts %>%
  filter(speaker_number == 0) %>%
  select(file_name, last_update) 

affected_calls %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 1939

affected_calls %>% 
  anti_join(speaker_data, by = "file_name") %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 1898

affected_calls %>% 
  anti_join(calls, by = "file_name") %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 0

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

iangow / se_features

Sort out `speaker_number ==0` entries #22