Identify filings with missing speaker_names

iangow commented 5 years ago

There are many cases where speaker_name is missing (NULL) in SQL. When I ran the import_speaker_data.R code on some of these yesterday, these were not missing. This suggests that our code and data got out of sync at some point. I think we need to re-run the code for these cases. But first we need to identify them and delete the processed data from the data base.

# Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")
library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

# Define PG data ----

# Project schema
rs <- dbGetQuery(pg, "SET search_path TO streetevents, public")

speaker_data <- tbl(pg, "speaker_data")

speaker_data %>% 
    filter(is.na(speaker_name)) 
#> # Source:   lazy query [?? x 9]
#> # Database: postgres 9.6.11 [igow@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_name employer role  speaker_number
#>    <chr>     <dttm>              <chr>        <chr>    <chr>          <int>
#>  1 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                 1
#>  2 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                 1
#>  3 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                11
#>  4 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                21
#>  5 3054286_T 2010-05-26 12:24:09 <NA>         ""       ""                69
#>  6 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                47
#>  7 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                67
#>  8 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                78
#>  9 4930935_T 2012-11-09 17:08:37 <NA>         ""       ""                80
#> 10 854220_T  2004-02-24 20:37:37 <NA>         ""       ""                 1
#> # ... with more rows, and 3 more variables: speaker_text <chr>,
#> #   context <chr>, section <int>

^{Created on 2018-11-20 by the reprex package (v0.2.1)}

danielacarrasco commented 5 years ago

I ran the code and got these ones with missing speakers.

>  Database: postgres 9.6.11 [dcarrasco@10.101.13.99:5432/crsp]
>    file_name last_update         speaker_name employer role  speaker_number
    <chr>    <dttm>              <chr>        <chr>    <chr>          <int>
>  1 646169_T  2002-07-26 15:25:11 NA          ""      ""                55
>  2 632098_T  2002-05-22 13:41:02 NA          ""      ""                2
>  3 627068_T  2002-05-23 18:13:40 NA          ""      ""              101
>  4 577965_T  2002-08-15 12:10:49 NA          ""      ""                51
>  5 635285_T  2002-06-20 20:31:53 NA          ""      ""                21
>  6 638344_T  2002-06-18 19:20:39 NA          ""      ""                32
>  7 638344_T  2002-06-18 19:20:39 NA          ""      ""                92
>  8 638344_T  2002-06-18 19:20:39 NA          ""      ""                94
>  9 638344_T  2002-06-18 19:20:39 NA          ""      ""                97
> 10 3686198_T 2011-02-02 19:59:30 NA          ""      ""                3
>  … with more rows, and 3 more variables: speaker_text <chr>, context <chr>,
>    section <int>
>

I re ran the code for one of the files but it still came up as speaker_name missing. I checked the .xml files and they indeed have some missing speakers.

What should I do?

iangow commented 5 years ago

If there are no speaker names in the original data files, there's nothing to be done. If there are speaker names in the original data files, then we need to check whether the current code correctly parses out the speaker names ("code parses OK"). If code parses OK, then I think the best approach would be to run SQL to delete the calls meeting this criterion: DELETE FROM streetevents.speaker_data WHERE file_name = '646169_T'; (if doing this for many files, individual SQL statements are probably not ideal; let's discuss if this is the case). If not code parses OK, then we may need to tweak the code to correctly extract the data.

iangow commented 5 years ago

So I identified the file_name values associated with empty speaker_name values:

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgres::Postgres())

# Define PG data ----

# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")

speaker_data <- tbl(pg, "speaker_data")

rs <- dbExecute(pg, "DROP TABLE IF EXISTS missing_names")

missing_names <-
    speaker_data %>% 
    filter(is.na(speaker_name)) %>%
    distinct(file_name) %>%
    compute(name = "missing_names", temporary = FALSE)

missing_names %>% count()

> missing_names %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres [igow@10.101.13.99:5432/crsp]
  n              
  <S3: integer64>
1 346

Note that code above creates a table streetevents.missing_names with these 346 file_name values.

I then ran the following SQL:

igowmbp15:~ igow$ psql -h 10.101.13.99 -d crsp
psql (11.1, server 9.6.11)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# DELETE FROM streetevents.speaker_data
WHERE file_name IN (SELECT file_name FROM streetevents.missing_names);
DELETE 36913
crsp=#

This deletes the data from streetevents.speaker_data related to these 346 calls.

I then ran the code to import data into speaker_data. This code automatically detects the calls that don't have speaker_data associated with them and processes these calls:

> source('~/git/se_core/import_speaker_data.R')
Importing speaker data.
[1] 346
[1] "Speaker data has 33390 rows"
[1] "Writing data to Postgres"
[1] "Writing dupe data to Postgres"
New rows:  TRUE

I then ran the first piece of code again:

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgres::Postgres())

# Define PG data ----

# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")

speaker_data <- tbl(pg, "speaker_data")

rs <- dbExecute(pg, "DROP TABLE IF EXISTS missing_names")

missing_names <-
    speaker_data %>% 
    filter(is.na(speaker_name)) %>%
    distinct(file_name) %>%
    compute(name = "missing_names", temporary = FALSE)

missing_names %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n              
#>   <S3: integer64>
#> 1 338

^{Created on 2019-02-11 by the reprex package (v0.2.1)}

So it seems that most of these calls either do not have speaker_name data or the code does not correctly parse this data for these files. I think it makes sense to take a sample of ~20 file_name values and inspect the underlying XML files to see if there is no valid data to parse (if you look at 4-5 and there is valid data to parse, then it probably make sense to focus on addressing the parsing issues first).

danielacarrasco commented 5 years ago

I checked some of the files. I believe the problem is that there are portions of the files with missing speakers. For example, file 1002981_T.xml below.

Operator   [182]
--------------------------------------------------------------------------------
<text>
<text>
<text>
--------------------------------------------------------------------------------
 Jack London,  CACI International - Chairman,   President  and  CEO   [183]
--------------------------------------------------------------------------------
<text>
<text>
<text>
--------------------------------------------------------------------------------
Operator   [184]
--------------------------------------------------------------------------------
 Once again thank you for your participation.
--------------------------------------------------------------------------------
[185]
--------------------------------------------------------------------------------
 That does conclude today's call. You may disconnect at this time.

That last line shows up as missing speaker. Some files have just one missing speaker with a line like the one above, while others have a more substantial portion of the conference call with the speakers missing. I was thinking I can run a script to check what portion of the call is missing speakers? It doesn't seem to be a problem when parsing the data though, it is an issue of the original file.

danielacarrasco commented 5 years ago

@iangow what should I do regarding the previous message?

iangow commented 5 years ago

It doesn't seem to be a problem when parsing the data though, it is an issue of the original file.

To the extent that this is the case, there is nothing we can do.

iangow / se_core

Identify filings with missing speaker_names #3