iangow / se_features

Linguistic features derived from StreetEvents
1 stars 3 forks source link

Add code for creating topic table #30

Closed Yvonne-Han closed 4 years ago

Yvonne-Han commented 4 years ago

This issue is for adding code for creating corresponding topic_measure tables. For testing the new functions in the new package, see original post in #26

I think the table for "topic" should be kls_domains. You could adapt the code from the other folders, but you probably want CREATE TABLE functionality like that in the NER code if that isn't there already.

_Originally posted by @iangow in https://github.com/iangow/se_features/issues/26#issuecomment-629275167_

Yvonne-Han commented 4 years ago

I've already added code (topic_run.py) for creating the kls_domain table in se_features here a6072d2, followed by another commit to replace topic_functions with the functions in the new package.

@iangow I've checked on a smaller sample (n=3 files) and the code should work fine. However, I want to keep this issue open until we run topic_run.py on all calls in StreetEvents (probably over next weekend?)

Yvonne-Han commented 4 years ago

Running topic_run.py now (2020-05-23 22:38:37 AEST).

Yvonne-Han commented 4 years ago

@iangow I'm closing this issue now. See below for a preview of se_features.kls_domain.

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO se_features")
rs <- dbExecute(pg, "SET work_mem TO '5GB'")

kls_domain <- tbl(pg, "kls_domain")

kls_domain
#> # Source:   table<kls_domain> [?? x 28]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_number context section market
#>    <chr>     <dttm>                       <int> <chr>     <int> <lgl> 
#>  1 3117755_T 2010-05-26 01:09:59             27 qa            1 FALSE 
#>  2 3117755_T 2010-05-26 01:09:59             26 qa            1 FALSE 
#>  3 11944816… 2018-11-08 13:30:45             72 qa            1 FALSE 
#>  4 3117755_T 2010-05-26 01:09:59             25 qa            1 FALSE 
#>  5 11944816… 2018-11-08 13:30:45             71 qa            1 FALSE 
#>  6 11944816… 2018-11-08 13:30:45             70 qa            1 FALSE 
#>  7 3117755_T 2010-05-26 01:09:59             24 qa            1 FALSE 
#>  8 11944816… 2018-11-08 13:30:45             69 qa            1 FALSE 
#>  9 3117755_T 2010-05-26 01:09:59             23 qa            1 TRUE  
#> 10 11944816… 2018-11-08 13:30:45             68 qa            1 TRUE  
#> # … with more rows, and 22 more variables: competition <lgl>,
#> #   industry_structure <lgl>, strategic_intent <lgl>,
#> #   innovation_and_r_d <lgl>, mode_of_entry <lgl>, business_model <lgl>,
#> #   partnerships <lgl>, leadership <lgl>, management_quality <lgl>,
#> #   governance <lgl>, disclosure <lgl>, measures <lgl>, customer <lgl>,
#> #   brand <lgl>, media <lgl>, advertising <lgl>, corporate_image <lgl>,
#> #   financial_performance <lgl>, forecasting <lgl>,
#> #   insider_stock_transactions <lgl>, regulation <lgl>,
#> #   special_interest_groups <lgl>

kls_domain %>%
  select(file_name) %>%
  distinct() %>% 
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 474207

Created on 2020-05-24 by the reprex package (v0.3.0)