iangow / se_core

Core code for StreetEvents data
7 stars 5 forks source link

Investigate empty `role` in speaker_data #14

Open Yvonne-Han opened 4 years ago

Yvonne-Han commented 4 years ago

In table speaker_data, quite a few entries have missing role values. Some of them are missing because the speaker is an operator (which is fine).

After excluding the "Operator" case, we still have quite a few entries (~8%) that have missing role values.

> speaker_data %>%
+     filter(role == "" & speaker_name %NOT ILIKE% "%opera%") %>%
+     count()
# Source:   lazy query [?? x 1]
# Database: postgres [yanzih1@10.101.13.99:5432/crsp]
  n      
  <int64>
1 2418461

The top 10 categories for speaker_name associated with missing role are listed below:

> speaker_data %>%
+     filter(role == "" & speaker_name %NOT ILIKE% "%opera%") %>%
+     count(speaker_name) %>%
+     arrange(desc(n))
# Source:     lazy query [?? x 2]
# Database:   postgres [yanzih1@10.101.13.99:5432/crsp]
# Ordered by: desc(n)
   speaker_name                         n      
   <chr>                                <int64>
 1 Unidentified Audience Member         364384 
 2 Unidentified Participant             188161 
 3 Unidentified Company Representative  177457 
 4 Unidentified Analyst,                111578 
 5 Unidentified Speaker                  74929 
 6 Unidentified                          65018 
 7 Unidentified Participant,             60734 
 8 Unidentified Company Representative,  44342 
 9 Analyst,                              33588 
10 Thomson Reuters Media,                23484 

I think we can at least fix some of the missing role values. For instance, we can easily label 4 and 9 as analyst and 3 as company employee.

iangow commented 4 years ago

I think "unidentified" cases will be impossible to handle with the current code. But I think in most of these cases, we could figure out whether the speaker is an analyst or with the company. I wouldn't change the call-parsing code to achieve this unless we can extract information that we're currently missing.