Closed iangow closed 5 years ago
I ran the code and got these ones with missing speakers.
> Database: postgres 9.6.11 [dcarrasco@10.101.13.99:5432/crsp]
> file_name last_update speaker_name employer role speaker_number
<chr> <dttm> <chr> <chr> <chr> <int>
> 1 646169_T 2002-07-26 15:25:11 NA "" "" 55
> 2 632098_T 2002-05-22 13:41:02 NA "" "" 2
> 3 627068_T 2002-05-23 18:13:40 NA "" "" 101
> 4 577965_T 2002-08-15 12:10:49 NA "" "" 51
> 5 635285_T 2002-06-20 20:31:53 NA "" "" 21
> 6 638344_T 2002-06-18 19:20:39 NA "" "" 32
> 7 638344_T 2002-06-18 19:20:39 NA "" "" 92
> 8 638344_T 2002-06-18 19:20:39 NA "" "" 94
> 9 638344_T 2002-06-18 19:20:39 NA "" "" 97
> 10 3686198_T 2011-02-02 19:59:30 NA "" "" 3
> … with more rows, and 3 more variables: speaker_text <chr>, context <chr>,
> section <int>
>
I re ran the code for one of the files but it still came up as speaker_name missing. I checked the .xml files and they indeed have some missing speakers.
What should I do?
If there are no speaker names in the original data files, there's nothing to be done.
If there are speaker names in the original data files, then we need to check whether the current code correctly parses out the speaker names ("code parses OK").
If code parses OK, then I think the best approach would be to run SQL to delete the calls meeting this criterion: DELETE FROM streetevents.speaker_data WHERE file_name = '646169_T';
(if doing this for many files, individual SQL statements are probably not ideal; let's discuss if this is the case).
If not code parses OK, then we may need to tweak the code to correctly extract the data.
So I identified the file_name
values associated with empty speaker_name
values:
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgres::Postgres())
# Define PG data ----
# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")
speaker_data <- tbl(pg, "speaker_data")
rs <- dbExecute(pg, "DROP TABLE IF EXISTS missing_names")
missing_names <-
speaker_data %>%
filter(is.na(speaker_name)) %>%
distinct(file_name) %>%
compute(name = "missing_names", temporary = FALSE)
missing_names %>% count()
> missing_names %>% count()
# Source: lazy query [?? x 1]
# Database: postgres [igow@10.101.13.99:5432/crsp]
n
<S3: integer64>
1 346
Note that code above creates a table streetevents.missing_names
with these 346 file_name
values.
I then ran the following SQL:
igowmbp15:~ igow$ psql -h 10.101.13.99 -d crsp
psql (11.1, server 9.6.11)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.
crsp=# DELETE FROM streetevents.speaker_data
WHERE file_name IN (SELECT file_name FROM streetevents.missing_names);
DELETE 36913
crsp=#
This deletes the data from streetevents.speaker_data
related to these 346 calls.
I then ran the code to import data into speaker_data
. This code automatically detects the calls that don't have speaker_data
associated with them and processes these calls:
> source('~/git/se_core/import_speaker_data.R')
Importing speaker data.
[1] 346
[1] "Speaker data has 33390 rows"
[1] "Writing data to Postgres"
[1] "Writing dupe data to Postgres"
New rows: TRUE
I then ran the first piece of code again:
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgres::Postgres())
# Define PG data ----
# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")
speaker_data <- tbl(pg, "speaker_data")
rs <- dbExecute(pg, "DROP TABLE IF EXISTS missing_names")
missing_names <-
speaker_data %>%
filter(is.na(speaker_name)) %>%
distinct(file_name) %>%
compute(name = "missing_names", temporary = FALSE)
missing_names %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <S3: integer64>
#> 1 338
Created on 2019-02-11 by the reprex package (v0.2.1)
So it seems that most of these calls either do not have speaker_name
data or the code does not correctly parse this data for these files. I think it makes sense to take a sample of ~20 file_name
values and inspect the underlying XML files to see if there is no valid data to parse (if you look at 4-5 and there is valid data to parse, then it probably make sense to focus on addressing the parsing issues first).
I checked some of the files. I believe the problem is that there are portions of the files with missing speakers. For example, file 1002981_T.xml below.
Operator [182]
--------------------------------------------------------------------------------
<text>
<text>
<text>
--------------------------------------------------------------------------------
Jack London, CACI International - Chairman, President and CEO [183]
--------------------------------------------------------------------------------
<text>
<text>
<text>
--------------------------------------------------------------------------------
Operator [184]
--------------------------------------------------------------------------------
Once again thank you for your participation.
--------------------------------------------------------------------------------
[185]
--------------------------------------------------------------------------------
That does conclude today's call. You may disconnect at this time.
That last line shows up as missing speaker. Some files have just one missing speaker with a line like the one above, while others have a more substantial portion of the conference call with the speakers missing. I was thinking I can run a script to check what portion of the call is missing speakers? It doesn't seem to be a problem when parsing the data though, it is an issue of the original file.
@iangow what should I do regarding the previous message?
It doesn't seem to be a problem when parsing the data though, it is an issue of the original file.
To the extent that this is the case, there is nothing we can do.
There are many cases where
speaker_name
is missing (NULL
) in SQL. When I ran theimport_speaker_data.R
code on some of these yesterday, these were not missing. This suggests that our code and data got out of sync at some point. I think we need to re-run the code for these cases. But first we need to identify them and delete the processed data from the data base.Created on 2018-11-20 by the reprex package (v0.2.1)