quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

implementation details about textmodel_svm ? #47

Open randomgambit opened 3 years ago

randomgambit commented 3 years ago

Hello there!

I hope all is well during these difficult times! I was playing with the great quanteda and discovered the nice textmodel_svm classification model. However, contrary to textmodel_nb where there is a little example which reproduces juravsky's toy case, I cannot find anything about textmodel_svm.

Are any additional details available about this function (a quanteda tutorial, a toy example, etc)? What is happening under the hood when using textmodel_svm with dfms? Can we get back the coefficients for each token?

Thanks!

randomgambit commented 3 years ago

@kbenoit for instance I see here https://github.com/cran/quanteda.textmodels/blob/a1c52468a8004e9c8a23b67eee9584677f2dab71/tests/testthat/test-textmodel_svm.R that you check that the coefficients should be equal to

    expect_equal(
        coef(tmod)[1, 1:3, drop = FALSE],
        matrix(c(0.5535941, 0.1857624, 0.1857624), nrow = 1,
               dimnames = list(NULL, c("Chinese", "Beijing", "Shanghai"))),
        tol = .0000001
    )

How do you know that? The IR example only deals with naive bayes.

Thanks again!

randomgambit commented 3 years ago

actually @kbenoit @koheiw by looking at the manual https://cran.r-project.org/web/packages/LiblineaR/LiblineaR.pdf it seems the default textmodel_nb passes the default type = 0 which run a penalized logistic regression, not a SVM. But more generally I would be interested to find the reference for the "official" token coefficient shown in the unit test above.

Thanks again! quanteda rocks!

kbenoit commented 3 years ago

Yes we realised this recently... See #45. Easily overridden through type, which is passed via ....

Documentation is available in the references to textmodel_svm() as to how this works (methodologically I mean).

randomgambit commented 3 years ago

thanks @kbenoit, I saw the docs but I was curious to understand where do you get the coefficients matrix(c(0.5535941, 0.1857624, 0.1857624) in https://github.com/cran/quanteda.textmodels/blob/a1c52468a8004e9c8a23b67eee9584677f2dab71/tests/testthat/test-textmodel_svm.R

Are these the values computed in another textbook example and you are simply verifying that textmodel_svm produces the same correct coefficients? How do you know these are correct?

Thanks!

kbenoit commented 3 years ago

I think they came from running the code outside of the quanteda structure, so we are verifying it against running a non-quanteda version of the model with the same model data. Not a very strong test, but does check whether something went amiss in our wrapper.

Would be delighted for more critical tests or feedback, if you have it.

randomgambit commented 3 years ago

I am looking for some interesting docs. By the way, out of curiosity, do you know how does predict_svm recover the predicted probabilities for instance when using penalized logistic classification? Given K classes, is the algorithm fitting K one-vs-all classifiers, computing all K probabilities and then normalizing each probability by the sum of these probabilities (so that the sum is indeed one)? What do you think?

kbenoit commented 3 years ago

That's in the paper describing the method, but for multinomial logistic regression (of which the penalised approach is a special version), these are equivalent. The standard way is to compute this as per the last equation in https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions.