Study mislabelling of pres and qa

iangow / se_core

Core code for StreetEvents data

7 stars 5 forks source link

Study mislabelling of pres and qa #15

Closed Yvonne-Han closed 4 years ago

Yvonne-Han commented 4 years ago

Some of the calls only have one context (mostly being pres only):

> speaker_data %>%
+     select(file_name, context) %>%
+     group_by(file_name) %>%
+     distinct() %>%
+     summarise(context_no = count(context)) %>%
+     filter(context_no == 1) %>%
+     count()
# Source:   lazy query [?? x 1]
# Database: postgres [yanzih1@10.101.13.99:5432/crsp]
  n      
  <int64>
1 49785

Indeed, some of the calls only have 1 context. However, some of them can be due to the mislabelling of contexts (i.e., after some point, the entries in speaker_data should be labelled as qa but they were still marked as pres). Not sure how many of these can be identified and fixed.

iangow commented 4 years ago

We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.

Yvonne-Han commented 4 years ago

We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.

Last time Tony asked me about this, and I told him that I randomly found them in the shared folder in CyberDuck, but it is difficult to find the XML file for a given file_name because there are too many of them. I hope there is a better way to locate them?

Yvonne-Han commented 4 years ago

We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.

Last time Tony asked me about this, and I told him that I randomly found them in the shared folder in CyberDuck, but it is difficult to find the XML file for a given file_name because there are too many of them. I hope there is a better way to locate them?

I will take my last sentence back.

Yvonne-Han commented 4 years ago

[x] Find calls with context mislabelled
[x] Read the XML files and find irregularities/issues (if any)
[ ] Decide whether we want to fix any/all of them
[ ] Figure out a way to "re-parse" them

Yvonne-Han commented 4 years ago

Step 1: Find calls with context mislabeled.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

calls <- tbl(pg, "calls")
speaker_data <- tbl(pg, "speaker_data")

# Get a list of `context` number == 1 calls
single_context_calls <- speaker_data %>%
    select(file_name, last_update, context) %>%
    distinct() %>%
    group_by(file_name, last_update) %>%
    summarise(context_no = count(context)) %>%
    ungroup() %>%
    filter(context_no == 1) %>%
    compute()

# I guess earnings calls have a higher probability of having Q&As so focus on them
calls %>%
    filter(event_type == 1) %>%
    select(file_name, last_update) %>%
    inner_join(single_context_calls) %>%
    print(n=20)
#> Joining, by = c("file_name", "last_update")
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name  last_update         context_no
#>    <chr>      <dttm>              <int64>   
#>  1 1000327_T  2005-02-25 08:19:17 1         
#>  2 1005723_T  2005-02-04 08:31:45 1         
#>  3 1005816_T  2005-04-08 08:34:05 1         
#>  4 1006557_T  2005-03-06 01:50:16 1         
#>  5 1008353_T  2005-02-04 15:22:32 1         
#>  6 10105557_T 2017-04-26 10:06:07 1         
#>  7 1014609_T  2005-02-10 11:16:16 1         
#>  8 1015714_T  2005-02-23 07:27:39 1         
#>  9 1019984_T  2005-04-22 12:17:29 1         
#> 10 10213025_T 2017-05-21 07:00:42 1         
#> 11 10216128_T 2017-05-01 10:45:57 1         
#> 12 10230244_T 2017-05-02 09:26:52 1         
#> 13 1025827_T  2005-05-14 02:17:52 1         
#> 14 10275117_T 2017-05-26 16:41:05 1         
#> 15 10296997_T 2017-05-19 01:01:01 1         
#> 16 10299673_T 2017-05-01 09:56:21 1         
#> 17 10313906_T 2017-05-02 01:04:18 1         
#> 18 1033494_T  2005-08-13 01:59:11 1         
#> 19 1034516_T  2005-03-22 08:21:24 1         
#> 20 10381332_T 2017-05-03 09:26:37 1         
#> # … with more rows

# After trying a few calls, here's one example (analysts participated but all labelled as `pres`):
speaker_data %>% 
    filter(file_name == "928740_T") %>%
    select(file_name, last_update, employer, role, speaker_number, context) %>%
    arrange(speaker_number) %>%
    print(n=20)
#> # Source:     lazy query [?? x 6]
#> # Database:   postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: speaker_number
#>    file_name last_update         employer  role      speaker_number context
#>    <chr>     <dttm>              <chr>     <chr>              <int> <chr>  
#>  1 928740_T  2004-08-20 08:40:12 ""        ""                     1 pres   
#>  2 928740_T  2004-08-20 08:40:12 Sycamore… Vice Pre…              2 pres   
#>  3 928740_T  2004-08-20 08:40:12 Sycamore… Presiden…              3 pres   
#>  4 928740_T  2004-08-20 08:40:12 Sycamore… Chief Fi…              4 pres   
#>  5 928740_T  2004-08-20 08:40:12 ""        ""                     5 pres   
#>  6 928740_T  2004-08-20 08:40:12 Sycamore… Analyst                6 pres   
#>  7 928740_T  2004-08-20 08:40:12 Sycamore… Chief Fi…              7 pres   
#>  8 928740_T  2004-08-20 08:40:12 Sycamore… Analyst                8 pres   
#>  9 928740_T  2004-08-20 08:40:12 Sycamore… Chief Fi…              9 pres   
#> 10 928740_T  2004-08-20 08:40:12 Sycamore… Analyst               10 pres   
#> 11 928740_T  2004-08-20 08:40:12 Sycamore… Presiden…             11 pres   
#> 12 928740_T  2004-08-20 08:40:12 Sycamore… Analyst               12 pres   
#> 13 928740_T  2004-08-20 08:40:12 ""        ""                    13 pres   
#> 14 928740_T  2004-08-20 08:40:12 Thomas W… Analyst               14 pres   
#> 15 928740_T  2004-08-20 08:40:12 Sycamore… Chief Fi…             15 pres   
#> 16 928740_T  2004-08-20 08:40:12 Thomas W… Analyst               16 pres   
#> 17 928740_T  2004-08-20 08:40:12 Sycamore… Presiden…             17 pres   
#> 18 928740_T  2004-08-20 08:40:12 Thomas W… Analyst               18 pres   
#> 19 928740_T  2004-08-20 08:40:12 Sycamore… Presiden…             19 pres   
#> 20 928740_T  2004-08-20 08:40:12 Thomas W… Analyst               20 pres   
#> # … with more rows

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

iangow commented 4 years ago

Here I think archive_01/928740_T.xml is the file to look at.

In this case there is no delineation between Presentation and Q&A. One approach would be to handle calls without this delineation separately. Here is the line that normally handles this.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO streetevents")
call_files <- tbl(pg, "call_files")
call_files %>%  filter(file_name=="928740_T") %>% select(file_path)
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_path                                                                
#>   <chr>                                                                    
#> 1 archive_01/928740_T.xml                                                  
#> 2 streetevents_project/StreetEvents_historical_backfill_through_May2013/di…

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

iangow commented 4 years ago

I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company:

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO streetevents")
speaker_data <- tbl(pg, "speaker_data")
speaker_data %>% 
    filter(file_name=="928740_T") %>% 
    select(section, context, speaker_name, 
           employer, role, speaker_number) %>% 
    arrange(section, speaker_number)
#> # Source:     lazy query [?? x 6]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: section, speaker_number
#>    section context speaker_name  employer   role             speaker_number
#>      <int> <chr>   <chr>         <chr>      <chr>                     <int>
#>  1       1 pres    Operator      ""         ""                            1
#>  2       1 pres    Terry Adams   Sycamore … Vice President …              2
#>  3       1 pres    Dan Smith     Sycamore … President, Chie…              3
#>  4       1 pres    Frances Jewe… Sycamore … Chief Financial…              4
#>  5       1 pres    Operator      ""         ""                            5
#>  6       1 pres    Michael Geno… Sycamore … Analyst                       6
#>  7       1 pres    Frances Jewe… Sycamore … Chief Financial…              7
#>  8       1 pres    Michael Geno… Sycamore … Analyst                       8
#>  9       1 pres    Frances Jewe… Sycamore … Chief Financial…              9
#> 10       1 pres    Michael Geno… Sycamore … Analyst                      10
#> # … with more rows

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

Yvonne-Han commented 4 years ago

I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company.

If this is the case then indeed we probably can’t fix it ourselves. I will go look into more calls and see whether all of them look like this in the original XML files or if this one is just an outlier.

There’s probably no better ways to identify the mislabeled ones except for printing out the speaker_data and manually read through them... so it is also hard to figure out how many of the calls are mislabeled too.

iangow commented 4 years ago

I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company.

If this is the case then indeed we probably can’t fix it ourselves. I will go look into more calls and see whether all of them look like this in the original XML files or if this one is just an outlier.

There’s probably no better ways to identify the mislabeled ones except for printing out the speaker_data and manually read through them... so it is also hard to figure out how many of the calls are mislabeled too.

Pick a sample of ~10 from different periods (if applicable) and look at the original XML files (the issue I found would not be evident from speaker_data). Then we can make a call.

Yvonne-Han commented 4 years ago

I created the following list that may contain mislabeled calls.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")

calls <- tbl(pg, "calls")
speaker_data <- tbl(pg, "speaker_data")

# Get a list of `context` number == 1 calls
single_context_calls <- speaker_data %>%
    select(file_name, last_update, context) %>%
    distinct() %>%
    group_by(file_name, last_update) %>%
    summarise(context_no = count(context)) %>%
    ungroup() %>%
    filter(context_no == 1) %>%
    compute()

# Not sure every call in this list is mislabeled, but can be a good starting point
speaker_data %>% 
    right_join(single_context_calls) %>%
    filter(speaker_text %ILIKE% "%question%" | role %ILIKE% "%analyst%" | speaker_name %ILIKE% "%analyst%") %>%
    select(file_name, last_update) %>%
    distinct() %>%
    left_join(calls) %>%
    select(file_name, last_update, start_date) %>%
    mutate(year = year(last_update)) %>%
    group_by(year) %>%
    mutate(min_date_by_year = min(last_update)) %>%
    ungroup() %>%
    filter(last_update == min_date_by_year) %>%
    arrange(year) %>%
    select(year, file_name, start_date, last_update) %>%
    print(n=Inf)
#> Joining, by = c("file_name", "last_update")
#> Joining, by = c("file_name", "last_update")
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 4]
#> # Database:   postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: year
#>     year file_name  start_date          last_update        
#>    <dbl> <chr>      <dttm>              <dttm>             
#>  1  2002 608013_T   2002-03-13 01:30:00 2002-04-13 09:27:56
#>  2  2003 688122_T   2002-12-20 00:00:00 2003-01-04 03:27:32
#>  3  2004 830015_T   2004-01-06 00:30:00 2004-01-06 06:53:48
#>  4  2005 995284_T   2005-01-04 04:00:00 2005-01-04 04:34:39
#>  5  2006 1174887_T  2005-12-14 00:00:00 2006-01-04 04:13:59
#>  6  2007 1423865_T  2006-12-16 03:00:00 2007-01-03 06:01:14
#>  7  2008 1662203_T  2007-11-06 21:00:00 2008-01-03 02:35:52
#>  8  2009 2041123_T  2008-12-10 04:30:00 2009-01-06 01:27:13
#>  9  2010 2577925_T  2009-12-04 00:00:00 2010-01-04 15:12:27
#> 10  2011 3392021_T  2010-11-09 00:00:00 2011-01-04 00:33:27
#> 11  2012 3584085_T  2011-11-12 00:00:00 2012-01-04 01:53:06
#> 12  2013 4954470_T  2012-11-29 09:30:00 2013-01-03 02:20:03
#> 13  2014 5212687_T  2013-12-04 02:00:00 2014-01-03 03:56:35
#> 14  2015 5518887_T  2014-11-12 03:30:00 2015-01-03 04:04:49
#> 15  2016 5863718_T  2015-11-19 19:00:00 2016-01-05 01:10:09
#> 16  2017 7371001_T  2017-01-05 06:45:00 2017-01-05 09:31:43
#> 17  2018 11080110_T 2017-11-01 21:59:00 2018-01-02 06:59:54
#> 18  2019 12167592_T NA                  2019-01-02 01:11:20
#> 19  2020 12949854_T 2019-12-11 08:00:00 2020-01-02 03:24:54

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

Yvonne-Han commented 4 years ago

After manually going through the speaker_data of the list, here is a list of 12 files from different time periods that are mislabeled. I've added another empty column description to document the findings from reading the XML files / speaker_data table.

@iangow You probably want to focus on the table below and see whether some of the cases can be fixed. My guess is, if possible, we might need to figure out how to re-parse those calls without pres and qa delineation but with analysts correctly identified?

year	file_name	start_date	last_update	description
2002	109803_T	2001-04-19 07:00:00	2002-05-28 23:54:44	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); The first few people who asked questions are "unknown speaker"s. The rest is correctly identified as `Analysts` in `speaker_data` with employers different from the host company.
2004	830015_T	2004-01-06 00:30:00	2004-01-06 06:53:48	No `pres` and `qa` delineation in the XML file No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`) Analysts are correctly identified
2005	1000242_T	2005-01-28 09:30:00	2005-01-28 18:17:51	The delineation is called `Transcript` instead of `Q&A` in this file. The speaker number restarts from 1 when `qa` starts. Analysts are correctly identified. Note: In our `speaker_text`, we labelled them as two sections instead of two contexts.
2006	1174887_T	2005-12-14 00:00:00	2006-01-04 04:13:59	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); Analysts are correctly identified.
2008	1612127_T	2007-08-03 01:30:00	2008-02-23 08:09:33	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); Analysts are correctly identified.
2009	2041123_T	2008-12-10 04:30:00	2009-01-06 01:27:13	This one is more tricky as it is a `event_type ==7` call. An analyst start these calls but they still take questions from the audience. No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); People who asked questions are "Unidentified Audience Member".
2010	2436211_T	2010-07-28 22:00:00	2010-07-29 05:22:52	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); Analysts are correctly identified.
2013	1013061_T	2005-03-04 03:00:00	2013-01-31 04:29:57	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); Analysts are correctly identified.
2014	5212687_T	2013-12-04 02:00:00	2014-01-03 03:56:35	Another `event_type ==7` call; No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); People who asked questions are "Unidentified Audience Member".
2016	5830639_T	2016-07-20 18:00:00	2016-10-07 10:14:36	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); Analysts are correctly identified.
2018	10008059_T	NA	2018-08-31 10:07:07	Another `event_type ==7` call; StreetEvents updated their XML files again on 2019-07-02 and it now has `pres` and `qa` delineation. However, except for the first sentence "We are all set.", everything is labelled as `qa`; Not sure this is a common thing to do in `event_type==7` calls (probably irrelevant here).
2019	1066001_T	2005-05-10 23:45:00	2019-04-04 17:57:09	No `pres` and `qa` delineation in the XML file; No other indications of a new `context` (i.e., speaker number does not restart from 1 for `qa`); People who asked questions are "Unidentified Audience Member". Company managers, instead of the operator, took the questions.

Yvonne-Han commented 4 years ago

File paths for the above selected calls are shown below (most of them can be found in archive_01, if not all of them).

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO streetevents")
calls <- tbl(pg, "calls")
call_files <- tbl(pg, "call_files")

selected_call_names = c("109803_T", "830015_T", "1000242_T", "1174887_T", "1612127_T",
                        "2041123_T", "2436211_T", "1013061_T", "5212687_T", "5830639_T",
                        "10008059_T", "1066001_T")

selected_calls <- calls %>% 
    filter(file_name %in% selected_call_names) %>%
    group_by(file_name) %>%
    mutate(latest_update = max(last_update)) %>%
    ungroup() %>%
    filter(last_update == latest_update) %>%
    select(-latest_update) 

selected_calls_path <- call_files %>%
    inner_join(selected_calls) %>%
    select(file_name, file_path) %>%
    arrange(file_name) %>%
    print(n=Inf)
#> Joining, by = "file_name"
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: file_name
#>    file_name  file_path                                                    
#>    <chr>      <chr>                                                        
#>  1 1000242_T  streetevents_project/StreetEvents_historical_backfill_throug…
#>  2 1000242_T  archive_01/1000242_T.xml                                     
#>  3 10008059_T archive_01/10008059_T.xml                                    
#>  4 1013061_T  archive_01/1013061_T.xml                                     
#>  5 1013061_T  streetevents_project/StreetEvents_historical_backfill_throug…
#>  6 1066001_T  streetevents_project/StreetEvents_historical_backfill_throug…
#>  7 1066001_T  archive_01/1066001_T.xml                                     
#>  8 109803_T   archive_01/109803_T.xml                                      
#>  9 109803_T   streetevents_project/StreetEvents_historical_backfill_throug…
#> 10 1174887_T  streetevents_project/StreetEvents_historical_backfill_throug…
#> 11 1174887_T  archive_01/1174887_T.xml                                     
#> 12 1612127_T  streetevents_project/StreetEvents_historical_backfill_throug…
#> 13 1612127_T  archive_01/1612127_T.xml                                     
#> 14 2041123_T  streetevents_project/StreetEvents_historical_backfill_throug…
#> 15 2041123_T  archive_01/2041123_T.xml                                     
#> 16 2436211_T  streetevents_project/StreetEvents_historical_backfill_throug…
#> 17 2436211_T  archive_01/2436211_T.xml                                     
#> 18 5212687_T  streetevents_project/StreetEvents_Sept2013_through_Feb2014/5…
#> 19 5212687_T  archive_01/5212687_T.xml                                     
#> 20 5830639_T  archive_01/5830639_T.xml                                     
#> 21 5830639_T  streetevents_project/StreetEvents_Oct2016/5830639_T.xml      
#> 22 830015_T   streetevents_project/StreetEvents_historical_backfill_throug…
#> 23 830015_T   archive_01/830015_T.xml

^{Created on 2020-04-21 by the reprex package (v0.3.0)}

iangow commented 4 years ago

I think 1000242_T.xml might be the only one we would want to address by changing the parsing code. We could in principle do this by changing the following line

https://github.com/iangow/se_core/blob/0f4c5b73eeaa941e1405b2787c9854eb047108d5/import_speaker_data.R#L103

but the problem is that this line

https://github.com/iangow/se_core/blob/0f4c5b73eeaa941e1405b2787c9854eb047108d5/import_speaker_data.R#L91

implies that I found cases where Transcript provided a demarcation between the XML preamble and the presentation part of the call. So we'd probably have some ugly code that says "if there are Presentation and Transcript, but no Q&A, then assume that Transcript begins the Q&A". It might make sense to build up a sample of files with ====\nTranscript in them to be able to test this before implementing. Perhaps make a new issue for that, but don't do it just yet.

For the other cases, I think the best approach might be to tell ThomsonReuters that there are problems in the files and then see if they can fix them (it seems they did that with 10008059_T.xml). Perhaps make an issue for that too and then we can pursue.

If that's everything, perhaps close this issue.

Yvonne-Han commented 4 years ago

I think 1000242_T.xml might be the only one we would want to address by changing the parsing code. We could in principle do this by changing the following line

https://github.com/iangow/se_core/blob/0f4c5b73eeaa941e1405b2787c9854eb047108d5/import_speaker_data.R#L103

but the problem is that this line

https://github.com/iangow/se_core/blob/0f4c5b73eeaa941e1405b2787c9854eb047108d5/import_speaker_data.R#L91

implies that I found cases where Transcript provided a demarcation between the XML preamble and the presentation part of the call. So we'd probably have some ugly code that says "if there are Presentation and Transcript, but no Q&A, then assume that Transcript begins the Q&A". It might make sense to build up a sample of files with ====\nTranscript in them to be able to test this before implementing. Perhaps make a new issue for that, but don't do it just yet.

I see. Probably we need to figure out how common this is (my guess is not too many files are structured in this way but we will see).

For the other cases, I think the best approach might be to tell ThomsonReuters that there are problems in the files and then see if they can fix them (it seems they did that with 10008059_T.xml). Perhaps make an issue for that too and then we can pursue.

Oh, I didn't know we have that option. Yes, I think ThomsonReuters sometimes does go back to some previous calls and tries to fix them. For calls with multiple last_updates, I can see that there is usually some improvement when compared to the last updated one.

If that's everything, perhaps close this issue.

Two new issues are created accordingly, and I will close this one.