The tests below work as predicted, until we get to pr_noun:
test_that("pr_noun computed the same in predict v component function", {
txt <- c(test1 = "One two cat. One two cat. Always eat apples.")
frompredict <- as.data.frame(sophistication:::get_covars_from_newdata.character(txt))
# should be: (1 + 1 + 1) / 9 = 0.33333
# doc_id sentence_id token_id token lemma pos entity
# 1 test1 1 1 One one NUM CARDINAL_B
# 2 test1 1 2 two two NUM CARDINAL_B
# 3 test1 1 3 cat cat NOUN
# 4 test1 1 4 . . PUNCT
# 5 test1 1 5 SPACE
# 6 test1 2 1 One one NUM CARDINAL_B
# 7 test1 2 2 two two NUM CARDINAL_B
# 8 test1 2 3 cat cat NOUN
# 9 test1 2 4 . . PUNCT
# 10 test1 2 5 SPACE
# 11 test1 3 1 Always always ADV
# 12 test1 3 2 eat eat VERB
# 13 test1 3 3 apples apple NOUN
# 14 test1 3 4 . . PUNCT
expect_equal(covars_make_pos(txt)[, c("pr_noun")], 0.333, tol = .001)
expect_equal(frompredict[, "pr_noun"], 0.333, tol = .001) # 0.214
})
# Error: covars_make_pos(txt)[, c("pr_noun")] not equal to 0.333.
# 1/1 mismatches
# [1] 0.214 - 0.333 == -0.119
What is happening is that the total tokens are being used for the denominator in covars_make_pos(), which includes SPACE and PUNCT. In the function get_covars_from_newdata() (used by predict_readability()`) this is being computed correctly. The difference in the example above is the denominator of 9 versus 14.
The tests below work as predicted, until we get to
pr_noun
:What is happening is that the total tokens are being used for the denominator in
covars_make_pos()
, which includes SPACE and PUNCT. In the functionget_covars_from_newdata()
(used by predict_readability()`) this is being computed correctly. The difference in the example above is the denominator of 9 versus 14.