Closed iangow closed 4 years ago
From an email I sent (could be recycled for the README):
The tables you want to use are:
calls
: Primary key is (file_name, last_update)
. This table contains metadata on the calls (e.g., event description, company name, city, start time).company_ids
: Primary key is (file_name, last_update)
. This contains data on company identifiers. Note that we only have these variables for data supplied through the UniMelb subscription. Because StreetEvents has a (terrible) habit of deleting data for companies that cease to exist, I have retained older data that I acquired at Harvard, but these files do not have company IDs (such as CUSIP, ISIN).speaker_data
: Primary key is (file_name, last_update, speaker_number, context, section)
, where context
is either "pres
" (presentation) or "qa
" (questions and answers) and section
is an integer that is used to distinguish portions of the call where there are multiple presentation and Q&A portions of the call (with speaker_number
starting from 1 in each new section).@azakolyukina
I have implemented the changes above on your server too. Please advise those who you think need to know (e.g., @jingyujzhang). There was some code you were using recently where I gave you a code snippet to accomplish this as a hack ... if you could direct me to that code, we could test that transitioning to the new table structure doesn't break anything.
@iangow
I will let Vadim and Jingyu know under big5
The code that uses hack is create_executive_features. R and create_executive_roles.R.
I re-ran code and got same results.
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem TO '10GB'") # Speeds-up creation of temporary tables
# Project schema
rs <- dbExecute(pg, "SET search_path TO big5, public")
executive_features <- tbl(pg, "executive_features")
executive_features_old <- tbl(pg, "executive_features_old")
executive_features %>%
anti_join(executive_features_old,
by=c("company_id", "executive_id", "file_name")) %>%
count() %>%
pull() == 0L
#> [1] TRUE
executive_features_old %>%
anti_join(executive_features,
by=c("company_id", "executive_id", "file_name")) %>%
count() %>%
pull() == 0L
#> [1] TRUE
executive_features_old %>%
inner_join(executive_features,
by=c("company_id", "executive_id", "file_name")) %>%
summarize_at(vars(matches("_qa")), funs(mean(., na.rm = TRUE))) %>%
collect() %>%
gather() %>%
arrange(key) %>%
print(n = Inf)
#> # A tibble: 70 x 2
#> key value
#> <chr> <dbl>
#> 1 adverb_qa.x 94.6
#> 2 adverb_qa.y 94.6
#> 3 anger_qa.x 1.07
#> 4 anger_qa.y 1.07
#> 5 anx_qa.x 1.44
#> 6 anx_qa.y 1.44
#> 7 article_qa.x 124.
#> 8 article_qa.y 124.
#> 9 assent_qa.x 7.10
#> 10 assent_qa.y 7.10
#> 11 cause_qa.x 27.5
#> 12 cause_qa.y 27.5
#> 13 certain_qa.x 22.1
#> 14 certain_qa.y 22.1
#> 15 conj_qa.x 113.
#> 16 conj_qa.y 113.
#> 17 discrep_qa.x 23.9
#> 18 discrep_qa.y 23.9
#> 19 excl_qa.x 41.9
#> 20 excl_qa.y 41.9
#> 21 frac_times_spoke_qa.x 2.70
#> 22 frac_times_spoke_qa.y 2.70
#> 23 frac_words_spoke_qa.x 1520.
#> 24 frac_words_spoke_qa.y 1520.
#> 25 generalisations_gklz_qa.x 27.5
#> 26 generalisations_gklz_qa.y 27.5
#> 27 genknlref_lz_qa.x 2.03
#> 28 genknlref_lz_qa.y 2.03
#> 29 hesit_lz_qa.x 0.175
#> 30 hesit_lz_qa.y 0.175
#> 31 i_qa.x 26.5
#> 32 i_qa.y 26.5
#> 33 incl_qa.x 139.
#> 34 incl_qa.y 139.
#> 35 inhib_qa.x 5.38
#> 36 inhib_qa.y 5.38
#> 37 insight_qa.x 38.0
#> 38 insight_qa.y 38.0
#> 39 negate_qa.x 19.4
#> 40 negate_qa.y 19.4
#> 41 negemoextr_lz_qa.x 2.93
#> 42 negemoextr_lz_qa.y 2.93
#> 43 numbers_qa.x 14.6
#> 44 numbers_qa.y 14.6
#> 45 posemoextr_lz_qa.x 7.89
#> 46 posemoextr_lz_qa.y 7.89
#> 47 posemone_lz_qa.x 50.8
#> 48 posemone_lz_qa.y 50.8
#> 49 qualifiers_gklz_qa.x 46.8
#> 50 qualifiers_gklz_qa.y 46.8
#> 51 quant_qa.x 57.4
#> 52 quant_qa.y 57.4
#> 53 sad_qa.x 2.78
#> 54 sad_qa.y 2.78
#> 55 swear_qa.x 0.0705
#> 56 swear_qa.y 0.0705
#> 57 tentat_qa.x 47.2
#> 58 tentat_qa.y 47.2
#> 59 thanks_qa.x 4.67
#> 60 thanks_qa.y 4.67
#> 61 vague_quantifiers_qa.x 35.7
#> 62 vague_quantifiers_qa.y 35.7
#> 63 we_qa.x 91.2
#> 64 we_qa.y 91.2
#> 65 words_spoke_qa.x 1299.
#> 66 words_spoke_qa.y 1299.
#> 67 words6_qa.x 423.
#> 68 words6_qa.y 423.
#> 69 wps_qa.x 21.2
#> 70 wps_qa.y 21.2
Created on 2019-01-03 by the reprex package (v0.2.1)
postgres=# CREATE TABLE big5.executive_features_old AS
SELECT * FROM big5.executive_features;
SELECT 316559
Followed by
postgres=# DROP TABLE big5.executive_features_old;
There is one discrepancy in executive_roles
. Not sure why ... but I figure it being a nonpers
executive, we are probably good here.
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem TO '10GB'") # Speeds-up creation of temporary tables
# Project schema
rs <- dbExecute(pg, "SET search_path TO big5, public")
executive_roles <- tbl(pg, "executive_roles")
executive_roles_old <- tbl(pg, "executive_roles_old")
executive_roles %>%
anti_join(executive_roles_old,
by=c("company_id", "executive_id", "file_name")) %>%
count() %>%
pull() == 0L
#> [1] TRUE
executive_roles_old %>%
anti_join(executive_roles,
by=c("company_id", "executive_id", "file_name"))
#> # Source: lazy query [?? x 7]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#> company_id executive_id file_name start_date is_ceo is_cfo
#> <int> <int> <chr> <dttm> <lgl> <lgl>
#> 1 8633 1169002 5624297_T 2015-03-12 21:00:00 FALSE TRUE
#> # ... with 1 more variable: source <chr>
executive_roles_old %>%
select(-start_date) %>%
anti_join(executive_roles)
#> Joining, by = c("company_id", "executive_id", "file_name", "is_ceo", "is_cfo", "source")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#> company_id executive_id file_name is_ceo is_cfo source
#> <int> <int> <chr> <lgl> <lgl> <chr>
#> 1 8633 1169002 5624297_T FALSE TRUE nonpers
executive_roles %>%
select(-start_date) %>%
anti_join(executive_roles_old)
#> Joining, by = c("company_id", "executive_id", "file_name", "is_ceo", "is_cfo", "source")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@aa.chicagobooth.edu:5432/postgres]
#> # ... with 6 variables: company_id <int>, executive_id <int>,
#> # file_name <chr>, is_ceo <lgl>, is_cfo <lgl>, source <chr>
Created on 2019-01-03 by the reprex package (v0.2.1)
@danielacarrasco Sorry about the chatter on this issue, which relates to "downstream" code using these data. But it would be good to update the documentation (the original reason for this issue).
So, I have a really basic question. When I run this on Rstudio it all seems to work fine. However, when I follow the ReadMe from Github and do it through the terminal I still have access issues:
Error in postgresqlpqExec(new.con, sql4) : RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation call_files ) Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call Execution halted
Do you know why this is happening? I've done it by running update_se.sh and just the create_call_files.R and have the same problem.
This is hard to address in the abstract. Are you using the same user ID in both places? Are you able to pose more output (and associated inputs) here?
I am using different IDs. I define:
PGHOST="10.101.13.99"
PGDATABASE="crsp"
PGUSER="dcarrasco"
PGPASSWORD=xxxxxxx
USER="carrascod"
I define user because it asks for the password associated with the server 45.113.235.201
This is what I get after running update_se.sh
receiving incremental file list
rsync: change_dir "/home/carrascod/uploads" failed: No such file or directory (2)
sent 8 bytes received 95 bytes 7.63 bytes/sec
total size is 0 speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [Receiver=3.1.2]
rsync: [Receiver] write error: Broken pipe (32)
importing call_files
Loading required package: methods
Loading required package: DBI
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted
importing calls
Loading required package: methods
Loading required package: DBI
[1] 821601
importing speaker_data
Importing speaker data.Loading required package: methods
Error in result_create(conn@ptr, statement) :
Failed to fetch row: ERROR: permission denied for relation calls
Calls: system.time ... new -> initialize -> initialize -> result_create -> .Call
Timing stopped at: 0.152 0.004 0.169
Execution halted
I see there is a problem with the directory, but I think the main issue is the permission. Hope this helps.
I should mention that I've downloaded the files manually (instead of doing in through the .sh) and then ran the create_call_files.R separately and have got the same message minus the first 7 lines:
Loading required package: methods
Loading required package: DBI
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted
importing calls
Loading required package: methods
Loading required package: DBI
[1] 821601
importing speaker_data
Importing speaker data.Loading required package: methods
Error in result_create(conn@ptr, statement) :
Failed to fetch row: ERROR: permission denied for relation calls
Calls: system.time ... new -> initialize -> initialize -> result_create -> .Call
Timing stopped at: 0.152 0.004 0.169
Execution halted
Sorry for all the messages.
Not sure what worked, but I have now managed to run:
I tried running create_calls.R and got the following:
Loading required package: methods
Error in result_create(conn@ptr, statement) :
Failed to fetch row: ERROR: must be owner of relation calls
Calls: dbExecute ... new -> initialize -> initialize -> result_create -> .Call
Execution halted
I'll check the code of create_calls.R and see where the error is coming from.
There was an inconsistency in the database that was causing the code to stop running (on my end) before the permissions were set up properly. I have almost fixed this now.
I think it should be good to go now.
I've tried it from home last night and now and it doesn't get past the second line Loading required package: DBI
when running create_call_files.R. Could it be my internet? If so, I'll go to uni later and try it there.
UPDATE: I'm at uni, connected with the cable, and the same is happening. Could it be some issue with the server?
@danielacarrasco It's very hard to diagnose the issue without additional details. You should be able to run R code line by line ... does that help identify the source of the issue?
Sorry, I ran it again (create_call_files.R) now and still get the permissions issue
Loading required package: methods
Updating data on 10.101.13.99
Joining, by = c("file_path", "mtime")
Joining, by = "file_path"
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation call_files
)
Calls: lapply ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
Execution halted
When doing it line by line, this is where it breaks (last part of the code):
if (dim(new_files_plus)[1]>0) {
new_file_dfs <- split_df(new_files_plus, n = 1000)
rs <- lapply(new_file_dfs, process_rows)
}
EDIT: This is when I run create_call_files.R now. Before I was having permission issues but when running create_calls.R , which was after create_call_files.R and import_calls_raw.R. I'll run the old version of create_call_files.R tomorrow and see if it still works.
@danielacarrasco I logged into RStudio using your account and I think I have sorted this out. I had to make some tweaks to the code (I committed these under your ID; see here) to get this to work.
Here is what I ran in the "Terminal" in RStudio Server (under your account). We have see ssh_exchange_identification
issues before; they miraculously solved themselves last time. So hopefully they do so again (this means that I can't sync my server with the server with the files due to some IP issue).
dcarrasco@igow-z640:~/se_core$ export PGDATABASE=crsp
dcarrasco@igow-z640:~/se_core$ export USER=carrascod
dcarrasco@igow-z640:~/se_core$ export PGHOST=localhost
dcarrasco@igow-z640:~/se_core$
dcarrasco@igow-z640:~/se_core$ ./update_se.sh
ssh_exchange_identification: read: Connection reset by peer
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(235) [Receiver=3.1.2]
Importing call_files
localhost
Updating data on localhost
Importing calls
Loading required package: DBI
Updating data on localhost
[1] 822492
Importing speaker_data
Importing speaker data.
user system elapsed
0.170 0.016 61.133
dcarrasco@igow-z640:~/se_core$
Sorry for the delay. It now works for me. I'll expand the README this morning. I have one question regarding accessing the tables. How can they be downloaded? I assume through SQL, but how exactly?
I can't imagine that you'd ever want to "download" these tables. For R, maybe read this.
You might "download" portions of the data using collect()
. The material at the link above covers this.
Thanks. I guess I used the wrong term, I should have said access it. Anyway, thanks for the link, it's very useful.
@iangow
I can't run the code using --vanilla
, i.e.
Rscript --vanilla create_call_files.R
does not work, but
Rscript create_call_files.R
does.
If this is the case for other users I should change the .sh file.
Also, I see you created a CRON job for downloading the files. Should I remove that step from the Readme?
Adding --vanilla
ensures that the environment variables come from the environment running R, not from ~/.Rprofile
. If it works without --vanilla
, then you probably have environment variables set in ~/.Rprofile
that you need to set before running the script.
In the cron job for this code, I have:
#!/usr/bin/env bash
export PGHOST=localhost
export PGDATABASE=crsp
export PGPASSWORD=**********
export PGUSER=igow
export CODE_DIR=/home/igow/git/se_core
cd $CODE_DIR
$CODE_DIR/update_se.sh
Regarding the README.md
file, the primary user should be the user of the data. This user doesn't need to know about the download steps. Put those in a later section of the readme directed at "advanced" users (those who will download and process data themselves).
@Yvonne-Han Once @yuchenliu4 accepts the invitation to join this repository, I think it makes sense to check that there aren't insights in Yuchen's HRM project that could included here. The idea is for the "readme" to be a short note useful for people new to the data.
@iangow I see. Will do.
Sent with GitHawk
See comments in #4 for more details. The end user should only need to know about
calls
,company_ids
andspeaker_data
. We should document the other tables (call_files
andcalls_raw
), but in a separate section for "advanced" users. The documentation should (in the first instance) go inreadme.md
in the root directory of the repository. If it gets too complicated, we can break this apart.