Closed iangow closed 5 years ago
@danielacarrasco So we're just waiting on "tone" data. You might want to use the changes I made to the the "fog" code to guide you a little. Let me know if you have any questions.
@danielacarrasco
BTW, you should be able to run create_training_data.R
to create a mock-up of the training data. The variable is_nonans
the classification we're trying to predict and the "features" to the right of it are what we will use in the model.
I think we will start with AdaBoost. I need to do a little digging to work out what package we should use, etc.
Fog is done:
igow@igow-z640:~/git/se_features$ fog_measure/fog_run.py
n_files: 231968
igow@igow-z640:~/git/se_features$ fog_measure/fog_run.py
n_files: 0
igow@igow-z640:~/git/se_features$
@danielacarrasco
I tweaked the code to include tone in the training data. There is one other set of features that we need to include (I will add these, and the code for them, soon). In the meantime, we seem to have NA
(None
or NULL
) values for some calls. From the following, it seems that the data aren't yet there for tone_measure
.
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO non_answer, se_features, public")
training_data <- tbl(pg, "training_data")
liwc <- tbl(pg, "liwc")
gold_standard <- tbl(pg, "gold_standard")
tone_measure <- tbl(pg, "tone_measure")
training_data %>%
filter(is.na(achieve)) %>%
distinct(file_name) %>%
inner_join(liwc)
#> Joining, by = "file_name"
#> # Source: lazy query [?? x 52]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name last_update speaker_name context section
#> <chr> <dttm> <chr> <chr> <int>
#> 1 2585926_T 2010-06-23 05:13:05 Operator qa 1
#> 2 2585926_T 2010-06-23 05:13:05 David Amy qa 1
#> 3 2585926_T 2010-06-23 05:13:05 Operator qa 1
#> 4 2585926_T 2010-06-23 05:13:05 Harry DeMott qa 1
#> 5 2585926_T 2010-06-23 05:13:05 David Amy qa 1
#> 6 2585926_T 2010-06-23 05:13:05 Harry DeMott qa 1
#> 7 2585926_T 2010-06-23 05:13:05 Operator qa 1
#> 8 2585926_T 2010-06-23 05:13:05 Marci Ryvic… qa 1
#> 9 2585926_T 2010-06-23 05:13:05 David Amy qa 1
#> 10 2585926_T 2010-06-23 05:13:05 Steve Marks qa 1
#> # … with more rows, and 47 more variables: speaker_number <int>,
#> # achieve <int>, adverb <int>, affect <int>, anger <int>, anx <int>,
#> # article <int>, assent <int>, cause <int>, certain <int>,
#> # cogmech <int>, conj <int>, discrep <int>, excl <int>, future <int>,
#> # generalisations_gklz <int>, genknlref_lz <int>, hesit_lz <int>,
#> # i <int>, incl <int>, inhib <int>, insight <int>, ipron <int>,
#> # money <int>, negate <int>, negemo <int>, negemoextr_lz <int>,
#> # negemone_lz <int>, past <int>, percept <int>, posemo <int>,
#> # posemoextr_lz <int>, posemone_lz <int>, ppron <int>, present <int>,
#> # pronoun <int>, qualifiers_gklz <int>, quant <int>, sad <int>,
#> # swear <int>, tentat <int>, thanks <int>, thirdpron_gklz <int>,
#> # vague_quantifiers <int>, value_gklz <int>, we <int>, you <int>
training_data %>%
filter(file_name == "2585926_T")
#> # Source: lazy query [?? x 75]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name answer_nums is_nonans achieve adverb affect anger anx article
#> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2585926_T {54} FALSE NA NA NA NA NA NA
#> # … with 66 more variables: assent <dbl>, cause <dbl>, certain <dbl>,
#> # cogmech <dbl>, conj <dbl>, discrep <dbl>, excl <dbl>, future <dbl>,
#> # generalisations_gklz <dbl>, genknlref_lz <dbl>, hesit_lz <dbl>,
#> # i <dbl>, incl <dbl>, inhib <dbl>, insight <dbl>, ipron <dbl>,
#> # money <dbl>, negate <dbl>, negemo <dbl>, negemoextr_lz <dbl>,
#> # negemone_lz <dbl>, past <dbl>, percept <dbl>, posemo <dbl>,
#> # posemoextr_lz <dbl>, posemone_lz <dbl>, ppron <dbl>, present <dbl>,
#> # pronoun <dbl>, qualifiers_gklz <dbl>, quant <dbl>, sad <dbl>,
#> # swear <dbl>, tentat <dbl>, thanks <dbl>, thirdpron_gklz <dbl>,
#> # vague_quantifiers <dbl>, value_gklz <dbl>, we <dbl>, you <dbl>,
#> # count <int>, sum <int>, sent_count <int>, sum_6 <int>, sum_num <int>,
#> # fog <dbl>, complex_words <dbl>, fog_words <dbl>, fog_sents <dbl>,
#> # positive <int>, negative <int>, uncertainty <int>, litigious <int>,
#> # modal_strong <int>, modal_weak <int>, regex_00 <lgl>, regex_01 <lgl>,
#> # regex_02 <lgl>, regex_03 <lgl>, regex_04 <lgl>, regex_05 <lgl>,
#> # regex_06 <lgl>, regex_07 <lgl>, regex_08 <lgl>, regex_09 <lgl>,
#> # regex_10 <lgl>
liwc %>%
filter(file_name == "2585926_T") %>%
filter(speaker_number == 56L)
#> # Source: lazy query [?? x 52]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name last_update speaker_name context section speaker_number
#> <chr> <dttm> <chr> <chr> <int> <int>
#> 1 2585926_T 2010-06-23 05:13:05 Lucy Rutish… qa 1 56
#> # … with 46 more variables: achieve <int>, adverb <int>, affect <int>,
#> # anger <int>, anx <int>, article <int>, assent <int>, cause <int>,
#> # certain <int>, cogmech <int>, conj <int>, discrep <int>, excl <int>,
#> # future <int>, generalisations_gklz <int>, genknlref_lz <int>,
#> # hesit_lz <int>, i <int>, incl <int>, inhib <int>, insight <int>,
#> # ipron <int>, money <int>, negate <int>, negemo <int>,
#> # negemoextr_lz <int>, negemone_lz <int>, past <int>, percept <int>,
#> # posemo <int>, posemoextr_lz <int>, posemone_lz <int>, ppron <int>,
#> # present <int>, pronoun <int>, qualifiers_gklz <int>, quant <int>,
#> # sad <int>, swear <int>, tentat <int>, thanks <int>,
#> # thirdpron_gklz <int>, vague_quantifiers <int>, value_gklz <int>,
#> # we <int>, you <int>
gold_standard %>%
filter(file_name == "2585926_T")
#> # Source: lazy query [?? x 7]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name answer_nums obs_type is_unable is_refuse is_after_call
#> <chr> <pq__int4> <chr> <lgl> <lgl> <lgl>
#> 1 2585926_T {54} train FALSE FALSE FALSE
#> # … with 1 more variable: is_nonans <lgl>
tone_measure %>%
filter(file_name == "2585926_T")
#> # Source: lazy query [?? x 11]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 11 variables: file_name <chr>, last_update <dttm>,
#> # speaker_number <int>, context <chr>, section <int>, positive <int>,
#> # negative <int>, uncertainty <int>, litigious <int>,
#> # modal_strong <int>, modal_weak <int>
Created on 2019-06-03 by the reprex package (v0.3.0)
Actually, one we have filled in the tone_measure
, we should be good to go.
@iangow I am not sure all the tables have been generated. I ran it yesterday from home, but sometimes the VPN loses connection and it crushes. I can run it this afternoon again to make sure they're all there.
When on the VPN, I often run code from RStudio Server, as that runs even if I get disconnected. I am running it now, so no need for you to do so:
fog
: Needed to re-run. Running now.liwc
word_counts
tone
: Still to be done.