iangow / se_core

Core code for StreetEvents data
7 stars 5 forks source link

Document new data structure #5

Closed iangow closed 4 years ago

iangow commented 5 years ago

See comments in #4 for more details. The end user should only need to know about calls, company_ids and speaker_data. We should document the other tables (call_files and calls_raw), but in a separate section for "advanced" users. The documentation should (in the first instance) go in readme.md in the root directory of the repository. If it gets too complicated, we can break this apart.

iangow commented 5 years ago

From an email I sent (could be recycled for the README):

The tables you want to use are:

iangow commented 5 years ago

@azakolyukina

I have implemented the changes above on your server too. Please advise those who you think need to know (e.g., @jingyujzhang). There was some code you were using recently where I gave you a code snippet to accomplish this as a hack ... if you could direct me to that code, we could test that transitioning to the new table structure doesn't break anything.

azakolyukina commented 5 years ago

@iangow

I will let Vadim and Jingyu know under big5

The code that uses hack is create_executive_features. R and create_executive_roles.R.

iangow commented 5 years ago

I re-ran code and got same results.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem TO '10GB'") # Speeds-up creation of temporary tables

# Project schema
rs <- dbExecute(pg, "SET search_path TO big5, public")
executive_features <- tbl(pg, "executive_features")
executive_features_old <- tbl(pg, "executive_features_old")

executive_features %>% 
    anti_join(executive_features_old, 
              by=c("company_id", "executive_id", "file_name")) %>%
    count() %>%
    pull() == 0L
#> [1] TRUE

executive_features_old %>% 
    anti_join(executive_features, 
              by=c("company_id", "executive_id", "file_name")) %>%
    count() %>%
    pull() == 0L
#> [1] TRUE

executive_features_old %>% 
    inner_join(executive_features, 
              by=c("company_id", "executive_id", "file_name")) %>%
    summarize_at(vars(matches("_qa")), funs(mean(., na.rm = TRUE))) %>%
    collect() %>%
    gather() %>% 
    arrange(key) %>% 
    print(n = Inf)
#> # A tibble: 70 x 2
#>    key                           value
#>    <chr>                         <dbl>
#>  1 adverb_qa.x                 94.6   
#>  2 adverb_qa.y                 94.6   
#>  3 anger_qa.x                   1.07  
#>  4 anger_qa.y                   1.07  
#>  5 anx_qa.x                     1.44  
#>  6 anx_qa.y                     1.44  
#>  7 article_qa.x               124.    
#>  8 article_qa.y               124.    
#>  9 assent_qa.x                  7.10  
#> 10 assent_qa.y                  7.10  
#> 11 cause_qa.x                  27.5   
#> 12 cause_qa.y                  27.5   
#> 13 certain_qa.x                22.1   
#> 14 certain_qa.y                22.1   
#> 15 conj_qa.x                  113.    
#> 16 conj_qa.y                  113.    
#> 17 discrep_qa.x                23.9   
#> 18 discrep_qa.y                23.9   
#> 19 excl_qa.x                   41.9   
#> 20 excl_qa.y                   41.9   
#> 21 frac_times_spoke_qa.x        2.70  
#> 22 frac_times_spoke_qa.y        2.70  
#> 23 frac_words_spoke_qa.x     1520.    
#> 24 frac_words_spoke_qa.y     1520.    
#> 25 generalisations_gklz_qa.x   27.5   
#> 26 generalisations_gklz_qa.y   27.5   
#> 27 genknlref_lz_qa.x            2.03  
#> 28 genknlref_lz_qa.y            2.03  
#> 29 hesit_lz_qa.x                0.175 
#> 30 hesit_lz_qa.y                0.175 
#> 31 i_qa.x                      26.5   
#> 32 i_qa.y                      26.5   
#> 33 incl_qa.x                  139.    
#> 34 incl_qa.y                  139.    
#> 35 inhib_qa.x                   5.38  
#> 36 inhib_qa.y                   5.38  
#> 37 insight_qa.x                38.0   
#> 38 insight_qa.y                38.0   
#> 39 negate_qa.x                 19.4   
#> 40 negate_qa.y                 19.4   
#> 41 negemoextr_lz_qa.x           2.93  
#> 42 negemoextr_lz_qa.y           2.93  
#> 43 numbers_qa.x                14.6   
#> 44 numbers_qa.y                14.6   
#> 45 posemoextr_lz_qa.x           7.89  
#> 46 posemoextr_lz_qa.y           7.89  
#> 47 posemone_lz_qa.x            50.8   
#> 48 posemone_lz_qa.y            50.8   
#> 49 qualifiers_gklz_qa.x        46.8   
#> 50 qualifiers_gklz_qa.y        46.8   
#> 51 quant_qa.x                  57.4   
#> 52 quant_qa.y                  57.4   
#> 53 sad_qa.x                     2.78  
#> 54 sad_qa.y                     2.78  
#> 55 swear_qa.x                   0.0705
#> 56 swear_qa.y                   0.0705
#> 57 tentat_qa.x                 47.2   
#> 58 tentat_qa.y                 47.2   
#> 59 thanks_qa.x                  4.67  
#> 60 thanks_qa.y                  4.67  
#> 61 vague_quantifiers_qa.x      35.7   
#> 62 vague_quantifiers_qa.y      35.7   
#> 63 we_qa.x                     91.2   
#> 64 we_qa.y                     91.2   
#> 65 words_spoke_qa.x          1299.    
#> 66 words_spoke_qa.y          1299.    
#> 67 words6_qa.x                423.    
#> 68 words6_qa.y                423.    
#> 69 wps_qa.x                    21.2   
#> 70 wps_qa.y                    21.2

Created on 2019-01-03 by the reprex package (v0.2.1)

iangow commented 5 years ago
postgres=# CREATE TABLE big5.executive_features_old  AS 
SELECT * FROM big5.executive_features;
SELECT 316559

Followed by

postgres=# DROP TABLE big5.executive_features_old;
iangow commented 5 years ago

There is one discrepancy in executive_roles. Not sure why ... but I figure it being a nonpers executive, we are probably good here.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem TO '10GB'") # Speeds-up creation of temporary tables

# Project schema
rs <- dbExecute(pg, "SET search_path TO big5, public")
executive_roles <- tbl(pg, "executive_roles")
executive_roles_old <- tbl(pg, "executive_roles_old")

executive_roles %>% 
    anti_join(executive_roles_old, 
              by=c("company_id", "executive_id", "file_name")) %>%
    count() %>%
    pull() == 0L
#> [1] TRUE

executive_roles_old %>% 
    anti_join(executive_roles, 
              by=c("company_id", "executive_id", "file_name")) 
#> # Source:   lazy query [?? x 7]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#>   company_id executive_id file_name start_date          is_ceo is_cfo
#>        <int>        <int> <chr>     <dttm>              <lgl>  <lgl> 
#> 1       8633      1169002 5624297_T 2015-03-12 21:00:00 FALSE  TRUE  
#> # ... with 1 more variable: source <chr>

executive_roles_old %>% 
    select(-start_date) %>%
    anti_join(executive_roles)
#> Joining, by = c("company_id", "executive_id", "file_name", "is_ceo", "is_cfo", "source")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#>   company_id executive_id file_name is_ceo is_cfo source 
#>        <int>        <int> <chr>     <lgl>  <lgl>  <chr>  
#> 1       8633      1169002 5624297_T FALSE  TRUE   nonpers

executive_roles %>% 
    select(-start_date) %>%
    anti_join(executive_roles_old)
#> Joining, by = c("company_id", "executive_id", "file_name", "is_ceo", "is_cfo", "source")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#> # ... with 6 variables: company_id <int>, executive_id <int>,
#> #   file_name <chr>, is_ceo <lgl>, is_cfo <lgl>, source <chr>

Created on 2019-01-03 by the reprex package (v0.2.1)

iangow commented 5 years ago

@danielacarrasco Sorry about the chatter on this issue, which relates to "downstream" code using these data. But it would be good to update the documentation (the original reason for this issue).

danielacarrasco commented 5 years ago

So, I have a really basic question. When I run this on Rstudio it all seems to work fine. However, when I follow the ReadMe from Github and do it through the terminal I still have access issues:

Error in postgresqlpqExec(new.con, sql4) : RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation call_files ) Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call Execution halted

Do you know why this is happening? I've done it by running update_se.sh and just the create_call_files.R and have the same problem.

iangow commented 5 years ago

This is hard to address in the abstract. Are you using the same user ID in both places? Are you able to pose more output (and associated inputs) here?

danielacarrasco commented 5 years ago

I am using different IDs. I define:

PGHOST="10.101.13.99"
PGDATABASE="crsp"
PGUSER="dcarrasco"
PGPASSWORD=xxxxxxx
USER="carrascod"

I define user because it asks for the password associated with the server 45.113.235.201

This is what I get after running update_se.sh

receiving incremental file list
rsync: change_dir "/home/carrascod/uploads" failed: No such file or directory (2)

sent 8 bytes  received 95 bytes  7.63 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [Receiver=3.1.2]
rsync: [Receiver] write error: Broken pipe (32)
importing call_files
Loading required package: methods
Loading required package: DBI
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted
importing calls
Loading required package: methods
Loading required package: DBI
[1] 821601
importing speaker_data
Importing speaker data.Loading required package: methods
Error in result_create(conn@ptr, statement) : 
  Failed to fetch row: ERROR:  permission denied for relation calls
Calls: system.time ... new -> initialize -> initialize -> result_create -> .Call
Timing stopped at: 0.152 0.004 0.169
Execution halted

I see there is a problem with the directory, but I think the main issue is the permission. Hope this helps.

danielacarrasco commented 5 years ago

I should mention that I've downloaded the files manually (instead of doing in through the .sh) and then ran the create_call_files.R separately and have got the same message minus the first 7 lines:

Loading required package: methods
Loading required package: DBI
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted
importing calls
Loading required package: methods
Loading required package: DBI
[1] 821601
importing speaker_data
Importing speaker data.Loading required package: methods
Error in result_create(conn@ptr, statement) : 
  Failed to fetch row: ERROR:  permission denied for relation calls
Calls: system.time ... new -> initialize -> initialize -> result_create -> .Call
Timing stopped at: 0.152 0.004 0.169
Execution halted
danielacarrasco commented 5 years ago

Sorry for all the messages.

Not sure what worked, but I have now managed to run:

I tried running create_calls.R and got the following:

Loading required package: methods
Error in result_create(conn@ptr, statement) : 
  Failed to fetch row: ERROR:  must be owner of relation calls
Calls: dbExecute ... new -> initialize -> initialize -> result_create -> .Call
Execution halted

I'll check the code of create_calls.R and see where the error is coming from.

iangow commented 5 years ago

There was an inconsistency in the database that was causing the code to stop running (on my end) before the permissions were set up properly. I have almost fixed this now.

iangow commented 5 years ago

I think it should be good to go now.

danielacarrasco commented 5 years ago

I've tried it from home last night and now and it doesn't get past the second line Loading required package: DBI when running create_call_files.R. Could it be my internet? If so, I'll go to uni later and try it there.

UPDATE: I'm at uni, connected with the cable, and the same is happening. Could it be some issue with the server?

iangow commented 5 years ago

@danielacarrasco It's very hard to diagnose the issue without additional details. You should be able to run R code line by line ... does that help identify the source of the issue?

danielacarrasco commented 5 years ago

Sorry, I ran it again (create_call_files.R) now and still get the permissions issue

Loading required package: methods
Updating data on 10.101.13.99 
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted

When doing it line by line, this is where it breaks (last part of the code):

if (dim(new_files_plus)[1]>0) {

    new_file_dfs <- split_df(new_files_plus, n = 1000)

    rs <- lapply(new_file_dfs, process_rows)
}

EDIT: This is when I run create_call_files.R now. Before I was having permission issues but when running create_calls.R , which was after create_call_files.R and import_calls_raw.R. I'll run the old version of create_call_files.R tomorrow and see if it still works.

iangow commented 5 years ago

@danielacarrasco I logged into RStudio using your account and I think I have sorted this out. I had to make some tweaks to the code (I committed these under your ID; see here) to get this to work.

iangow commented 5 years ago

Here is what I ran in the "Terminal" in RStudio Server (under your account). We have see ssh_exchange_identification issues before; they miraculously solved themselves last time. So hopefully they do so again (this means that I can't sync my server with the server with the files due to some IP issue).

dcarrasco@igow-z640:~/se_core$ export PGDATABASE=crsp
dcarrasco@igow-z640:~/se_core$ export USER=carrascod
dcarrasco@igow-z640:~/se_core$ export PGHOST=localhost
dcarrasco@igow-z640:~/se_core$
dcarrasco@igow-z640:~/se_core$ ./update_se.sh
ssh_exchange_identification: read: Connection reset by peer
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(235) [Receiver=3.1.2]
Importing call_files
localhost
Updating data on localhost
Importing calls
Loading required package: DBI
Updating data on localhost
[1] 822492
Importing speaker_data
Importing speaker data.
   user  system elapsed
  0.170   0.016  61.133
dcarrasco@igow-z640:~/se_core$
danielacarrasco commented 5 years ago

Sorry for the delay. It now works for me. I'll expand the README this morning. I have one question regarding accessing the tables. How can they be downloaded? I assume through SQL, but how exactly?

iangow commented 5 years ago

I can't imagine that you'd ever want to "download" these tables. For R, maybe read this.

iangow commented 5 years ago

You might "download" portions of the data using collect(). The material at the link above covers this.

danielacarrasco commented 5 years ago

Thanks. I guess I used the wrong term, I should have said access it. Anyway, thanks for the link, it's very useful.

danielacarrasco commented 5 years ago

@iangow I can't run the code using --vanilla, i.e. Rscript --vanilla create_call_files.R does not work, but Rscript create_call_files.R does. If this is the case for other users I should change the .sh file.

Also, I see you created a CRON job for downloading the files. Should I remove that step from the Readme?

iangow commented 5 years ago

Adding --vanilla ensures that the environment variables come from the environment running R, not from ~/.Rprofile. If it works without --vanilla, then you probably have environment variables set in ~/.Rprofile that you need to set before running the script.

In the cron job for this code, I have:

#!/usr/bin/env bash
export PGHOST=localhost
export PGDATABASE=crsp
export PGPASSWORD=**********
export PGUSER=igow
export CODE_DIR=/home/igow/git/se_core
cd $CODE_DIR
$CODE_DIR/update_se.sh

Regarding the README.md file, the primary user should be the user of the data. This user doesn't need to know about the download steps. Put those in a later section of the readme directed at "advanced" users (those who will download and process data themselves).

iangow commented 4 years ago

@Yvonne-Han Once @yuchenliu4 accepts the invitation to join this repository, I think it makes sense to check that there aren't insights in Yuchen's HRM project that could included here. The idea is for the "readme" to be a short note useful for people new to the data.

Yvonne-Han commented 4 years ago

@iangow I see. Will do.

Sent with GitHawk