Bug in covars_make() - Githubissues

The tests below work as predicted, until we get to pr_noun:


test_that("pr_noun computed the same in predict v component function", {
    txt <- c(test1 = "One two cat.  One two cat.  Always eat apples.")
    frompredict <- as.data.frame(sophistication:::get_covars_from_newdata.character(txt))

    # should be: (1 + 1 + 1) / 9 = 0.33333
    # doc_id sentence_id token_id  token  lemma   pos     entity
    # 1   test1           1        1    One    one   NUM CARDINAL_B
    # 2   test1           1        2    two    two   NUM CARDINAL_B
    # 3   test1           1        3    cat    cat  NOUN           
    # 4   test1           1        4      .      . PUNCT           
    # 5   test1           1        5               SPACE           
    # 6   test1           2        1    One    one   NUM CARDINAL_B
    # 7   test1           2        2    two    two   NUM CARDINAL_B
    # 8   test1           2        3    cat    cat  NOUN           
    # 9   test1           2        4      .      . PUNCT           
    # 10  test1           2        5               SPACE           
    # 11  test1           3        1 Always always   ADV           
    # 12  test1           3        2    eat    eat  VERB           
    # 13  test1           3        3 apples  apple  NOUN           
    # 14  test1           3        4      .      . PUNCT  

    expect_equal(covars_make_pos(txt)[, c("pr_noun")], 0.333, tol = .001)
    expect_equal(frompredict[, "pr_noun"], 0.333, tol = .001)    # 0.214
})

# Error: covars_make_pos(txt)[, c("pr_noun")] not equal to 0.333.
# 1/1 mismatches
# [1] 0.214 - 0.333 == -0.119

What is happening is that the total tokens are being used for the denominator in covars_make_pos(), which includes SPACE and PUNCT. In the function get_covars_from_newdata() (used by predict_readability()`) this is being computed correctly. The difference in the example above is the denominator of 9 versus 14.

kbenoit / sophistication

Bug in covars_make() #9