Closed Yvonne-Han closed 4 years ago
We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.
We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.
Last time Tony asked me about this, and I told him that I randomly found them in the shared folder in CyberDuck, but it is difficult to find the XML file for a given file_name
because there are too many of them. I hope there is a better way to locate them?
We may need to re-parse these. It may make sense to look at the original XML files. I can help you find those.
Last time Tony asked me about this, and I told him that I randomly found them in the shared folder in CyberDuck, but it is difficult to find the XML file for a given
file_name
because there are too many of them. I hope there is a better way to locate them?
I will take my last sentence back.
context
mislabelledStep 1: Find calls with context
mislabeled.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")
calls <- tbl(pg, "calls")
speaker_data <- tbl(pg, "speaker_data")
# Get a list of `context` number == 1 calls
single_context_calls <- speaker_data %>%
select(file_name, last_update, context) %>%
distinct() %>%
group_by(file_name, last_update) %>%
summarise(context_no = count(context)) %>%
ungroup() %>%
filter(context_no == 1) %>%
compute()
# I guess earnings calls have a higher probability of having Q&As so focus on them
calls %>%
filter(event_type == 1) %>%
select(file_name, last_update) %>%
inner_join(single_context_calls) %>%
print(n=20)
#> Joining, by = c("file_name", "last_update")
#> # Source: lazy query [?? x 3]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> file_name last_update context_no
#> <chr> <dttm> <int64>
#> 1 1000327_T 2005-02-25 08:19:17 1
#> 2 1005723_T 2005-02-04 08:31:45 1
#> 3 1005816_T 2005-04-08 08:34:05 1
#> 4 1006557_T 2005-03-06 01:50:16 1
#> 5 1008353_T 2005-02-04 15:22:32 1
#> 6 10105557_T 2017-04-26 10:06:07 1
#> 7 1014609_T 2005-02-10 11:16:16 1
#> 8 1015714_T 2005-02-23 07:27:39 1
#> 9 1019984_T 2005-04-22 12:17:29 1
#> 10 10213025_T 2017-05-21 07:00:42 1
#> 11 10216128_T 2017-05-01 10:45:57 1
#> 12 10230244_T 2017-05-02 09:26:52 1
#> 13 1025827_T 2005-05-14 02:17:52 1
#> 14 10275117_T 2017-05-26 16:41:05 1
#> 15 10296997_T 2017-05-19 01:01:01 1
#> 16 10299673_T 2017-05-01 09:56:21 1
#> 17 10313906_T 2017-05-02 01:04:18 1
#> 18 1033494_T 2005-08-13 01:59:11 1
#> 19 1034516_T 2005-03-22 08:21:24 1
#> 20 10381332_T 2017-05-03 09:26:37 1
#> # … with more rows
# After trying a few calls, here's one example (analysts participated but all labelled as `pres`):
speaker_data %>%
filter(file_name == "928740_T") %>%
select(file_name, last_update, employer, role, speaker_number, context) %>%
arrange(speaker_number) %>%
print(n=20)
#> # Source: lazy query [?? x 6]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: speaker_number
#> file_name last_update employer role speaker_number context
#> <chr> <dttm> <chr> <chr> <int> <chr>
#> 1 928740_T 2004-08-20 08:40:12 "" "" 1 pres
#> 2 928740_T 2004-08-20 08:40:12 Sycamore… Vice Pre… 2 pres
#> 3 928740_T 2004-08-20 08:40:12 Sycamore… Presiden… 3 pres
#> 4 928740_T 2004-08-20 08:40:12 Sycamore… Chief Fi… 4 pres
#> 5 928740_T 2004-08-20 08:40:12 "" "" 5 pres
#> 6 928740_T 2004-08-20 08:40:12 Sycamore… Analyst 6 pres
#> 7 928740_T 2004-08-20 08:40:12 Sycamore… Chief Fi… 7 pres
#> 8 928740_T 2004-08-20 08:40:12 Sycamore… Analyst 8 pres
#> 9 928740_T 2004-08-20 08:40:12 Sycamore… Chief Fi… 9 pres
#> 10 928740_T 2004-08-20 08:40:12 Sycamore… Analyst 10 pres
#> 11 928740_T 2004-08-20 08:40:12 Sycamore… Presiden… 11 pres
#> 12 928740_T 2004-08-20 08:40:12 Sycamore… Analyst 12 pres
#> 13 928740_T 2004-08-20 08:40:12 "" "" 13 pres
#> 14 928740_T 2004-08-20 08:40:12 Thomas W… Analyst 14 pres
#> 15 928740_T 2004-08-20 08:40:12 Sycamore… Chief Fi… 15 pres
#> 16 928740_T 2004-08-20 08:40:12 Thomas W… Analyst 16 pres
#> 17 928740_T 2004-08-20 08:40:12 Sycamore… Presiden… 17 pres
#> 18 928740_T 2004-08-20 08:40:12 Thomas W… Analyst 18 pres
#> 19 928740_T 2004-08-20 08:40:12 Sycamore… Presiden… 19 pres
#> 20 928740_T 2004-08-20 08:40:12 Thomas W… Analyst 20 pres
#> # … with more rows
Created on 2020-04-20 by the reprex package (v0.3.0)
Here I think archive_01/928740_T.xml
is the file to look at.
In this case there is no delineation between Presentation
and Q&A. One approach would be to handle calls without this delineation separately. Here is the line that normally handles this.
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents")
call_files <- tbl(pg, "call_files")
call_files %>% filter(file_name=="928740_T") %>% select(file_path)
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_path
#> <chr>
#> 1 archive_01/928740_T.xml
#> 2 streetevents_project/StreetEvents_historical_backfill_through_May2013/di…
Created on 2020-04-20 by the reprex package (v0.3.0)
I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company:
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents")
speaker_data <- tbl(pg, "speaker_data")
speaker_data %>%
filter(file_name=="928740_T") %>%
select(section, context, speaker_name,
employer, role, speaker_number) %>%
arrange(section, speaker_number)
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: section, speaker_number
#> section context speaker_name employer role speaker_number
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 pres Operator "" "" 1
#> 2 1 pres Terry Adams Sycamore … Vice President … 2
#> 3 1 pres Dan Smith Sycamore … President, Chie… 3
#> 4 1 pres Frances Jewe… Sycamore … Chief Financial… 4
#> 5 1 pres Operator "" "" 5
#> 6 1 pres Michael Geno… Sycamore … Analyst 6
#> 7 1 pres Frances Jewe… Sycamore … Chief Financial… 7
#> 8 1 pres Michael Geno… Sycamore … Analyst 8
#> 9 1 pres Frances Jewe… Sycamore … Chief Financial… 9
#> 10 1 pres Michael Geno… Sycamore … Analyst 10
#> # … with more rows
Created on 2020-04-20 by the reprex package (v0.3.0)
I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company.
If this is the case then indeed we probably can’t fix it ourselves. I will go look into more calls and see whether all of them look like this in the original XML files or if this one is just an outlier.
There’s probably no better ways to identify the mislabeled ones except for printing out the speaker_data and manually read through them... so it is also hard to figure out how many of the calls are mislabeled too.
I think "fixing" these files ourselves will be hard. For one thing, it looks like StreetEvents itself parsed these badly (I conjecture that they have a database and make the XML files from that database). You can see here that they've labeled the analysts as working for the covered company.
If this is the case then indeed we probably can’t fix it ourselves. I will go look into more calls and see whether all of them look like this in the original XML files or if this one is just an outlier.
There’s probably no better ways to identify the mislabeled ones except for printing out the speaker_data and manually read through them... so it is also hard to figure out how many of the calls are mislabeled too.
Pick a sample of ~10 from different periods (if applicable) and look at the original XML files (the issue I found would not be evident from speaker_data
). Then we can make a call.
I created the following list that may contain mislabeled calls.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")
calls <- tbl(pg, "calls")
speaker_data <- tbl(pg, "speaker_data")
# Get a list of `context` number == 1 calls
single_context_calls <- speaker_data %>%
select(file_name, last_update, context) %>%
distinct() %>%
group_by(file_name, last_update) %>%
summarise(context_no = count(context)) %>%
ungroup() %>%
filter(context_no == 1) %>%
compute()
# Not sure every call in this list is mislabeled, but can be a good starting point
speaker_data %>%
right_join(single_context_calls) %>%
filter(speaker_text %ILIKE% "%question%" | role %ILIKE% "%analyst%" | speaker_name %ILIKE% "%analyst%") %>%
select(file_name, last_update) %>%
distinct() %>%
left_join(calls) %>%
select(file_name, last_update, start_date) %>%
mutate(year = year(last_update)) %>%
group_by(year) %>%
mutate(min_date_by_year = min(last_update)) %>%
ungroup() %>%
filter(last_update == min_date_by_year) %>%
arrange(year) %>%
select(year, file_name, start_date, last_update) %>%
print(n=Inf)
#> Joining, by = c("file_name", "last_update")
#> Joining, by = c("file_name", "last_update")
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 4]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: year
#> year file_name start_date last_update
#> <dbl> <chr> <dttm> <dttm>
#> 1 2002 608013_T 2002-03-13 01:30:00 2002-04-13 09:27:56
#> 2 2003 688122_T 2002-12-20 00:00:00 2003-01-04 03:27:32
#> 3 2004 830015_T 2004-01-06 00:30:00 2004-01-06 06:53:48
#> 4 2005 995284_T 2005-01-04 04:00:00 2005-01-04 04:34:39
#> 5 2006 1174887_T 2005-12-14 00:00:00 2006-01-04 04:13:59
#> 6 2007 1423865_T 2006-12-16 03:00:00 2007-01-03 06:01:14
#> 7 2008 1662203_T 2007-11-06 21:00:00 2008-01-03 02:35:52
#> 8 2009 2041123_T 2008-12-10 04:30:00 2009-01-06 01:27:13
#> 9 2010 2577925_T 2009-12-04 00:00:00 2010-01-04 15:12:27
#> 10 2011 3392021_T 2010-11-09 00:00:00 2011-01-04 00:33:27
#> 11 2012 3584085_T 2011-11-12 00:00:00 2012-01-04 01:53:06
#> 12 2013 4954470_T 2012-11-29 09:30:00 2013-01-03 02:20:03
#> 13 2014 5212687_T 2013-12-04 02:00:00 2014-01-03 03:56:35
#> 14 2015 5518887_T 2014-11-12 03:30:00 2015-01-03 04:04:49
#> 15 2016 5863718_T 2015-11-19 19:00:00 2016-01-05 01:10:09
#> 16 2017 7371001_T 2017-01-05 06:45:00 2017-01-05 09:31:43
#> 17 2018 11080110_T 2017-11-01 21:59:00 2018-01-02 06:59:54
#> 18 2019 12167592_T NA 2019-01-02 01:11:20
#> 19 2020 12949854_T 2019-12-11 08:00:00 2020-01-02 03:24:54
Created on 2020-04-20 by the reprex package (v0.3.0)
After manually going through the speaker_data
of the list, here is a list of 12 files from different time periods that are mislabeled. I've added another empty column description
to document the findings from reading the XML files / speaker_data
table.
@iangow You probably want to focus on the table below and see whether some of the cases can be fixed. My guess is, if possible, we might need to figure out how to re-parse those calls without pres
and qa
delineation but with analysts correctly identified?
year | file_name | start_date | last_update | description |
---|---|---|---|---|
2002 | 109803_T | 2001-04-19 07:00:00 | 2002-05-28 23:54:44 |
|
2004 | 830015_T | 2004-01-06 00:30:00 | 2004-01-06 06:53:48 |
|
2005 | 1000242_T | 2005-01-28 09:30:00 | 2005-01-28 18:17:51 |
|
2006 | 1174887_T | 2005-12-14 00:00:00 | 2006-01-04 04:13:59 |
|
2008 | 1612127_T | 2007-08-03 01:30:00 | 2008-02-23 08:09:33 |
|
2009 | 2041123_T | 2008-12-10 04:30:00 | 2009-01-06 01:27:13 |
|
2010 | 2436211_T | 2010-07-28 22:00:00 | 2010-07-29 05:22:52 |
|
2013 | 1013061_T | 2005-03-04 03:00:00 | 2013-01-31 04:29:57 |
|
2014 | 5212687_T | 2013-12-04 02:00:00 | 2014-01-03 03:56:35 |
|
2016 | 5830639_T | 2016-07-20 18:00:00 | 2016-10-07 10:14:36 |
|
2018 | 10008059_T | NA | 2018-08-31 10:07:07 |
|
2019 | 1066001_T | 2005-05-10 23:45:00 | 2019-04-04 17:57:09 |
|
File paths for the above selected calls are shown below (most of them can be found in archive_01
, if not all of them).
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents")
calls <- tbl(pg, "calls")
call_files <- tbl(pg, "call_files")
selected_call_names = c("109803_T", "830015_T", "1000242_T", "1174887_T", "1612127_T",
"2041123_T", "2436211_T", "1013061_T", "5212687_T", "5830639_T",
"10008059_T", "1066001_T")
selected_calls <- calls %>%
filter(file_name %in% selected_call_names) %>%
group_by(file_name) %>%
mutate(latest_update = max(last_update)) %>%
ungroup() %>%
filter(last_update == latest_update) %>%
select(-latest_update)
selected_calls_path <- call_files %>%
inner_join(selected_calls) %>%
select(file_name, file_path) %>%
arrange(file_name) %>%
print(n=Inf)
#> Joining, by = "file_name"
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> # Ordered by: file_name
#> file_name file_path
#> <chr> <chr>
#> 1 1000242_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 2 1000242_T archive_01/1000242_T.xml
#> 3 10008059_T archive_01/10008059_T.xml
#> 4 1013061_T archive_01/1013061_T.xml
#> 5 1013061_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 6 1066001_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 7 1066001_T archive_01/1066001_T.xml
#> 8 109803_T archive_01/109803_T.xml
#> 9 109803_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 10 1174887_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 11 1174887_T archive_01/1174887_T.xml
#> 12 1612127_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 13 1612127_T archive_01/1612127_T.xml
#> 14 2041123_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 15 2041123_T archive_01/2041123_T.xml
#> 16 2436211_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 17 2436211_T archive_01/2436211_T.xml
#> 18 5212687_T streetevents_project/StreetEvents_Sept2013_through_Feb2014/5…
#> 19 5212687_T archive_01/5212687_T.xml
#> 20 5830639_T archive_01/5830639_T.xml
#> 21 5830639_T streetevents_project/StreetEvents_Oct2016/5830639_T.xml
#> 22 830015_T streetevents_project/StreetEvents_historical_backfill_throug…
#> 23 830015_T archive_01/830015_T.xml
Created on 2020-04-21 by the reprex package (v0.3.0)
I think 1000242_T.xml
might be the only one we would want to address by changing the parsing code. We could in principle do this by changing the following line
but the problem is that this line
implies that I found cases where Transcript
provided a demarcation between the XML preamble and the presentation part of the call. So we'd probably have some ugly code that says "if there are Presentation
and Transcript
, but no Q&A, then assume that Transcript
begins the Q&A". It might make sense to build up a sample of files with ====\nTranscript
in them to be able to test this before implementing. Perhaps make a new issue for that, but don't do it just yet.
For the other cases, I think the best approach might be to tell ThomsonReuters that there are problems in the files and then see if they can fix them (it seems they did that with 10008059_T.xml
). Perhaps make an issue for that too and then we can pursue.
If that's everything, perhaps close this issue.
I think
1000242_T.xml
might be the only one we would want to address by changing the parsing code. We could in principle do this by changing the following linebut the problem is that this line
implies that I found cases where
Transcript
provided a demarcation between the XML preamble and the presentation part of the call. So we'd probably have some ugly code that says "if there arePresentation
andTranscript
, but no Q&A, then assume thatTranscript
begins the Q&A". It might make sense to build up a sample of files with====\nTranscript
in them to be able to test this before implementing. Perhaps make a new issue for that, but don't do it just yet.
I see. Probably we need to figure out how common this is (my guess is not too many files are structured in this way but we will see).
For the other cases, I think the best approach might be to tell ThomsonReuters that there are problems in the files and then see if they can fix them (it seems they did that with
10008059_T.xml
). Perhaps make an issue for that too and then we can pursue.
Oh, I didn't know we have that option. Yes, I think ThomsonReuters sometimes does go back to some previous calls and tries to fix them. For calls with multiple last_update
s, I can see that there is usually some improvement when compared to the last updated one.
If that's everything, perhaps close this issue.
Two new issues are created accordingly, and I will close this one.
Some of the calls only have one context (mostly being
pres
only):Indeed, some of the calls only have 1 context. However, some of them can be due to the mislabelling of contexts (i.e., after some point, the entries in
speaker_data
should be labelled asqa
but they were still marked aspres
). Not sure how many of these can be identified and fixed.