Pull together training data set

iangow commented 5 years ago

[x] fog: Needed to re-run. Running now.
[x] liwc
[x] word_counts
[x] tone: Still to be done.

iangow commented 5 years ago

@danielacarrasco So we're just waiting on "tone" data. You might want to use the changes I made to the the "fog" code to guide you a little. Let me know if you have any questions.

iangow commented 5 years ago

@danielacarrasco

BTW, you should be able to run create_training_data.R to create a mock-up of the training data. The variable is_nonans the classification we're trying to predict and the "features" to the right of it are what we will use in the model.

I think we will start with AdaBoost. I need to do a little digging to work out what package we should use, etc.

iangow commented 5 years ago

Fog is done:

igow@igow-z640:~/git/se_features$ fog_measure/fog_run.py
n_files: 231968
igow@igow-z640:~/git/se_features$ fog_measure/fog_run.py
n_files: 0
igow@igow-z640:~/git/se_features$

iangow commented 5 years ago

@danielacarrasco

I tweaked the code to include tone in the training data. There is one other set of features that we need to include (I will add these, and the code for them, soon). In the meantime, we seem to have NA (None or NULL) values for some calls. From the following, it seems that the data aren't yet there for tone_measure.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO non_answer, se_features, public")

training_data <- tbl(pg, "training_data")
liwc <- tbl(pg, "liwc")
gold_standard <- tbl(pg, "gold_standard")
tone_measure <- tbl(pg, "tone_measure")

training_data %>%
    filter(is.na(achieve)) %>% 
    distinct(file_name) %>% 
    inner_join(liwc)
#> Joining, by = "file_name"
#> # Source:   lazy query [?? x 52]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name last_update         speaker_name context section
#>    <chr>     <dttm>              <chr>        <chr>     <int>
#>  1 2585926_T 2010-06-23 05:13:05 Operator     qa            1
#>  2 2585926_T 2010-06-23 05:13:05 David Amy    qa            1
#>  3 2585926_T 2010-06-23 05:13:05 Operator     qa            1
#>  4 2585926_T 2010-06-23 05:13:05 Harry DeMott qa            1
#>  5 2585926_T 2010-06-23 05:13:05 David Amy    qa            1
#>  6 2585926_T 2010-06-23 05:13:05 Harry DeMott qa            1
#>  7 2585926_T 2010-06-23 05:13:05 Operator     qa            1
#>  8 2585926_T 2010-06-23 05:13:05 Marci Ryvic… qa            1
#>  9 2585926_T 2010-06-23 05:13:05 David Amy    qa            1
#> 10 2585926_T 2010-06-23 05:13:05 Steve Marks  qa            1
#> # … with more rows, and 47 more variables: speaker_number <int>,
#> #   achieve <int>, adverb <int>, affect <int>, anger <int>, anx <int>,
#> #   article <int>, assent <int>, cause <int>, certain <int>,
#> #   cogmech <int>, conj <int>, discrep <int>, excl <int>, future <int>,
#> #   generalisations_gklz <int>, genknlref_lz <int>, hesit_lz <int>,
#> #   i <int>, incl <int>, inhib <int>, insight <int>, ipron <int>,
#> #   money <int>, negate <int>, negemo <int>, negemoextr_lz <int>,
#> #   negemone_lz <int>, past <int>, percept <int>, posemo <int>,
#> #   posemoextr_lz <int>, posemone_lz <int>, ppron <int>, present <int>,
#> #   pronoun <int>, qualifiers_gklz <int>, quant <int>, sad <int>,
#> #   swear <int>, tentat <int>, thanks <int>, thirdpron_gklz <int>,
#> #   vague_quantifiers <int>, value_gklz <int>, we <int>, you <int>

training_data %>%
    filter(file_name == "2585926_T")
#> # Source:   lazy query [?? x 75]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_name answer_nums is_nonans achieve adverb affect anger   anx article
#>   <chr>     <chr>       <lgl>       <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl>
#> 1 2585926_T {54}        FALSE          NA     NA     NA    NA    NA      NA
#> # … with 66 more variables: assent <dbl>, cause <dbl>, certain <dbl>,
#> #   cogmech <dbl>, conj <dbl>, discrep <dbl>, excl <dbl>, future <dbl>,
#> #   generalisations_gklz <dbl>, genknlref_lz <dbl>, hesit_lz <dbl>,
#> #   i <dbl>, incl <dbl>, inhib <dbl>, insight <dbl>, ipron <dbl>,
#> #   money <dbl>, negate <dbl>, negemo <dbl>, negemoextr_lz <dbl>,
#> #   negemone_lz <dbl>, past <dbl>, percept <dbl>, posemo <dbl>,
#> #   posemoextr_lz <dbl>, posemone_lz <dbl>, ppron <dbl>, present <dbl>,
#> #   pronoun <dbl>, qualifiers_gklz <dbl>, quant <dbl>, sad <dbl>,
#> #   swear <dbl>, tentat <dbl>, thanks <dbl>, thirdpron_gklz <dbl>,
#> #   vague_quantifiers <dbl>, value_gklz <dbl>, we <dbl>, you <dbl>,
#> #   count <int>, sum <int>, sent_count <int>, sum_6 <int>, sum_num <int>,
#> #   fog <dbl>, complex_words <dbl>, fog_words <dbl>, fog_sents <dbl>,
#> #   positive <int>, negative <int>, uncertainty <int>, litigious <int>,
#> #   modal_strong <int>, modal_weak <int>, regex_00 <lgl>, regex_01 <lgl>,
#> #   regex_02 <lgl>, regex_03 <lgl>, regex_04 <lgl>, regex_05 <lgl>,
#> #   regex_06 <lgl>, regex_07 <lgl>, regex_08 <lgl>, regex_09 <lgl>,
#> #   regex_10 <lgl>

liwc %>%
    filter(file_name == "2585926_T") %>%
    filter(speaker_number == 56L)
#> # Source:   lazy query [?? x 52]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_name last_update         speaker_name context section speaker_number
#>   <chr>     <dttm>              <chr>        <chr>     <int>          <int>
#> 1 2585926_T 2010-06-23 05:13:05 Lucy Rutish… qa            1             56
#> # … with 46 more variables: achieve <int>, adverb <int>, affect <int>,
#> #   anger <int>, anx <int>, article <int>, assent <int>, cause <int>,
#> #   certain <int>, cogmech <int>, conj <int>, discrep <int>, excl <int>,
#> #   future <int>, generalisations_gklz <int>, genknlref_lz <int>,
#> #   hesit_lz <int>, i <int>, incl <int>, inhib <int>, insight <int>,
#> #   ipron <int>, money <int>, negate <int>, negemo <int>,
#> #   negemoextr_lz <int>, negemone_lz <int>, past <int>, percept <int>,
#> #   posemo <int>, posemoextr_lz <int>, posemone_lz <int>, ppron <int>,
#> #   present <int>, pronoun <int>, qualifiers_gklz <int>, quant <int>,
#> #   sad <int>, swear <int>, tentat <int>, thanks <int>,
#> #   thirdpron_gklz <int>, vague_quantifiers <int>, value_gklz <int>,
#> #   we <int>, you <int>

gold_standard %>%
    filter(file_name == "2585926_T") 
#> # Source:   lazy query [?? x 7]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_name answer_nums obs_type is_unable is_refuse is_after_call
#>   <chr>     <pq__int4>  <chr>    <lgl>     <lgl>     <lgl>        
#> 1 2585926_T {54}        train    FALSE     FALSE     FALSE        
#> # … with 1 more variable: is_nonans <lgl>

tone_measure %>%
    filter(file_name == "2585926_T") 
#> # Source:   lazy query [?? x 11]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 11 variables: file_name <chr>, last_update <dttm>,
#> #   speaker_number <int>, context <chr>, section <int>, positive <int>,
#> #   negative <int>, uncertainty <int>, litigious <int>,
#> #   modal_strong <int>, modal_weak <int>

^{Created on 2019-06-03 by the reprex package (v0.3.0)}

iangow commented 5 years ago

Actually, one we have filled in the tone_measure, we should be good to go.

danielacarrasco commented 5 years ago

@iangow I am not sure all the tables have been generated. I ran it yesterday from home, but sometimes the VPN loses connection and it crushes. I can run it this afternoon again to make sure they're all there.

iangow commented 5 years ago

When on the VPN, I often run code from RStudio Server, as that runs even if I get disconnected. I am running it now, so no need for you to do so:

iangow / se_features

Pull together training data set #10