Open stefan-mueller opened 4 years ago
Did you try textstat_svmlin()
?
Yes, textstat_svmlin()
works, but since it can only be used for binary classification tasks it won't fix the problem above (more than two groups).
# use 66,000 documents
dfmat_66000 <- dfmat_sotusent[1:66000, ]
table(dfmat_66000$party)
#>
#> Democratic Democratic-Republican Federalist
#> 27857 2959 187
#> Independent Republican Whig
#> 437 32623 1937
dfmat_66000$party_dummy <- ifelse(dfmat_66000$party == "Democratic", "Democratic", "Not Democratic")
# works when recoding the variable to a dummy
tmod_svm_66000 <- textmodel_svmlin(dfmat_66000, y = dfmat_66000$party_dummy)
All SVMs are based on two classes, but the textmodel_svm()
underlying library just happens to rescale the support vectors into probabilities for multiple classes. It would not be too complicated to implement the same for the returns from textmodel_svmlin()
. Can you create a new issue for this and I will try to solve it soon?
textmodel_svmlin()
is also much faster. In #14 I'm trying to convince @koheiw to reimplement the C++ code from RSSL without the bloat of the rest of that package, then we can just have a single function and make it return rescaled predictions for multiple classes.
I have also had this error when trying to use the textmodel_svm function with a large dfm, I think I have detected where the function fails and why it does it. The dfm I am using is the following:
> tr_tdm
Document-feature matrix of: 804,920 documents, 1,711,423 features (>99.99% sparse) and 1 docvar.
When calling the function with this dfm I get the Cholmod error 'problem too large'
> model <- textmodel_svm(x=tr_tdm, y=tr_tdm$type, weight="uniform")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
> traceback()
8: asMethod(object)
7: as(x, "matrix")
6: as.matrix.dfm(X)
5: as.matrix(X)
4: apply(x_train, 2, stats::var)
3: which(apply(x_train, 2, stats::var) == 0)
2: textmodel_svm.dfm(x = tr_tdm, y = tr_tdm$type, weight = "uniform")
1: textmodel_svm(x = tr_tdm, y = tr_tdm$type, weight = "uniform")
When doing the traceback we can see that the error occurs when calculating which variables are constants (non-variance features) in order to eliminate them from the matrix. To calculate the variance R transforms the dfm into a matrix and when doing this transformation, the Cholmod library makes sure that it will not produce an Integer overflow with the number of columns and the leading dimension (leading dimensions >= num rows).
I have also experienced problems converting the dfm to format csr:
#LiblineaR::LiblineaR(as.matrix.csr.dfm(x_train),
# target = y_train, wi = wi, type = type, ...)
> as.matrix.csr <- function(x) {
+ # convert first to column sparse format
+ as.matrix.csr(new("matrix.csc",
+ ra = x@x,
+ ja = x@i + 1L,
+ ia = x@p + 1L,
+ dimension = x@Dim))
+ }
> mtrx <- as.matrix.csr(tr_tdm)
Error: C stack usage 7973588 is too close to the limit
To avoid this problem I transform this big dfm into a RsparseMatrix of the Matrix package and remove some rows.
# exclude NA in training labels
#x_train <- suppressWarnings(
# dfm_trim(x[!is.na(y), ], min_termfreq = .0000000001, termfreq_type = "prop")
#)
#y_train <- y[!is.na(y)]
# remove zero-variance features
#constant_features <- which(apply(x_train, 2, stats::var) == 0)
#if (length(constant_features)) x_train <- x_train[, -constant_features]
> x_train <- as(x_train, "RsparseMatrix")
> LiblineaR::LiblineaR(x_train, target = y_train, wi = wi, type = type, ...)
After testing the function locally with the above changes (removing the constant variable calculation and converting the dfm to RsparseMatrix format) I can confirm that the function works correctly with the dfm used in this example.
In order to reproduce the error i uploaded a big dfm to gdrive
library("quanteda")
library("quanteda.textmodels")
library(googledrive)
temp_rds_file <- tempfile(fileext = ".rds")
d_rds_file <- drive_download(
as_id("1E4ZPUbR98vLW5hmL0GYYQ-GYPDP7vriR"), path = temp_r_file, overwrite = TRUE)
tdm <- readRDS(d_rds_file$local_path)
tmod <- textmodel_svm(tdm, y = tdm$type, weight = "uniform", verbose=TRUE)
Hi! Just wanted to comment that I too had issues with textmodel_svm, although in my case, just ballooning of used memory and an unceremonious crash of R each time I ran it with my full data set. Cutting the size of my data set led to no crashes. Creating my own duplicate of the textmodels_svm function with the variance check lines commented out led to the modified function working correctly on my full data set and an order of magnitude less memory used (maybe around 150 Mb, as opposed to 4 Gb). I didn't need to make any further changes, like the previous poster did.
Is that constant variance check line somehow un-sparsifying the matrix?
In a sense, it's easy to work around by just converting the dfm itself into matrix.csr format and using LiblineaR directly, but ti's nice to present code to students that contains fewer packages and conversion steps, so it would be nice if the quanteda command worked without error. Thanks for all your work on this package!
Thanks @thatchermo - we are still working out kinks in this function when it comes to larger matrix sizes.
When you say "with the variance check lines commented out" do you mean https://github.com/quanteda/quanteda.textmodels/blob/17f1c84f6b5fd3c063c2389e3259e907bacc3957/R/textmodel_svm.R#L63-L65
It would be great to have a reproducible example for what you describe, so we can run more tests. If the dataset cannot be posted here, feel free to email me outside of GitHub.
@kbenoit proxyC::colSds() == 0
would be better for large sparse matrices.
Here's an example and data (came from Kaggle originally). Note that on a beefier machine (16G ram), each try did run, but the unmodified textmodels_svm resulted in 11.5G memory use, most of which disappeared when I did gc(). I didn't realize at first that the low memory use I saw from LiblineaR used directly seemed to be from using type=2, rather than type=1. But it's still the case that commenting out the variance checking lines resulted in less memory use, even with type=1 in LiblineaR. svm_example.zip
textmodel_svm()
does not work when the number of documents used to train the classifier exceeds 66,000 on a MacBook Pro with 32GB RAM.