quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

textmodel_svm() fails when number of documents identified to train exceeds 66,000 #23

Open stefan-mueller opened 4 years ago

stefan-mueller commented 4 years ago

textmodel_svm() does not work when the number of documents used to train the classifier exceeds 66,000 on a MacBook Pro with 32GB RAM.

library(quanteda)
#> Package version: 2.0.1
#> Parallel computing: 2 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(quanteda.textmodels)
library(quanteda.corpora) # use SOTU corpus

# convert corpus to level of sentences to increase number of documents
corp_sotusent <- corpus_reshape(data_corpus_sotu, to = "sentences")

# create a dfm
dfmat_sotusent <- corp_sotusent %>% 
    dfm(remove_punct = TRUE, remove = stopwords("en"),
        remove_numbers = TRUE) %>% 
    dfm_trim(min_docfreq = 5, min_termfreq = 5)

# use 65,500 documents
dfmat_65500 <- dfmat_sotusent[1:65500, ]

# works
tmod_svm_65500 <- textmodel_svm(dfmat_65500, y = dfmat_65500$party)
tmod_svm_65500
#> 
#> Call:
#> textmodel_svm.dfm(x = dfmat_65500, y = dfmat_65500$party)
#> 
#> 65,500 training documents; 69,018 fitted features.
#> Method: L2-regularized logistic regression primal (L2R_LR)

# use 66,000 documents
dfmat_66000 <- dfmat_sotusent[1:66000, ]

tmod_svm_66000 <- textmodel_svm(dfmat_66000, y = dfmat_66000$party)
#> Error in asMethod(object) : 
#>  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
kbenoit commented 4 years ago

Did you try textstat_svmlin()?

stefan-mueller commented 4 years ago

Yes, textstat_svmlin() works, but since it can only be used for binary classification tasks it won't fix the problem above (more than two groups).


# use 66,000 documents
dfmat_66000 <- dfmat_sotusent[1:66000, ]

table(dfmat_66000$party)
#> 
#>            Democratic Democratic-Republican            Federalist 
#>                 27857                  2959                   187 
#>           Independent            Republican                  Whig 
#>                   437                 32623                  1937

dfmat_66000$party_dummy <- ifelse(dfmat_66000$party == "Democratic", "Democratic", "Not Democratic")

# works when recoding the variable to a dummy
tmod_svm_66000 <- textmodel_svmlin(dfmat_66000, y = dfmat_66000$party_dummy)
kbenoit commented 4 years ago

All SVMs are based on two classes, but the textmodel_svm() underlying library just happens to rescale the support vectors into probabilities for multiple classes. It would not be too complicated to implement the same for the returns from textmodel_svmlin(). Can you create a new issue for this and I will try to solve it soon?

textmodel_svmlin() is also much faster. In #14 I'm trying to convince @koheiw to reimplement the C++ code from RSSL without the bloat of the rest of that package, then we can just have a single function and make it return rescaled predictions for multiple classes.

Fcabla commented 3 years ago

I have also had this error when trying to use the textmodel_svm function with a large dfm, I think I have detected where the function fails and why it does it. The dfm I am using is the following:

> tr_tdm
Document-feature matrix of: 804,920 documents, 1,711,423 features (>99.99% sparse) and 1 docvar.

When calling the function with this dfm I get the Cholmod error 'problem too large'

> model <- textmodel_svm(x=tr_tdm, y=tr_tdm$type, weight="uniform")
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

> traceback()
8: asMethod(object)
7: as(x, "matrix")
6: as.matrix.dfm(X)
5: as.matrix(X)
4: apply(x_train, 2, stats::var)
3: which(apply(x_train, 2, stats::var) == 0)
2: textmodel_svm.dfm(x = tr_tdm, y = tr_tdm$type, weight = "uniform")
1: textmodel_svm(x = tr_tdm, y = tr_tdm$type, weight = "uniform")

When doing the traceback we can see that the error occurs when calculating which variables are constants (non-variance features) in order to eliminate them from the matrix. To calculate the variance R transforms the dfm into a matrix and when doing this transformation, the Cholmod library makes sure that it will not produce an Integer overflow with the number of columns and the leading dimension (leading dimensions >= num rows).

I have also experienced problems converting the dfm to format csr:

#LiblineaR::LiblineaR(as.matrix.csr.dfm(x_train),
#                                         target = y_train, wi = wi, type = type, ...)
> as.matrix.csr <- function(x) {
+     # convert first to column sparse format
+     as.matrix.csr(new("matrix.csc",
+                       ra = x@x,
+                       ja = x@i + 1L,
+                       ia = x@p + 1L,
+                       dimension = x@Dim))
+ }
> mtrx <- as.matrix.csr(tr_tdm)
Error: C stack usage  7973588 is too close to the limit

To avoid this problem I transform this big dfm into a RsparseMatrix of the Matrix package and remove some rows.

  # exclude NA in training labels
  #x_train <- suppressWarnings(
  #  dfm_trim(x[!is.na(y), ], min_termfreq = .0000000001, termfreq_type = "prop")
  #)
  #y_train <- y[!is.na(y)]

  # remove zero-variance features
  #constant_features <- which(apply(x_train, 2, stats::var) == 0)
  #if (length(constant_features)) x_train <- x_train[, -constant_features]
> x_train <- as(x_train, "RsparseMatrix")
> LiblineaR::LiblineaR(x_train, target = y_train, wi = wi, type = type, ...)

After testing the function locally with the above changes (removing the constant variable calculation and converting the dfm to RsparseMatrix format) I can confirm that the function works correctly with the dfm used in this example.

In order to reproduce the error i uploaded a big dfm to gdrive

library("quanteda")
library("quanteda.textmodels")
library(googledrive)
temp_rds_file <- tempfile(fileext = ".rds")
d_rds_file <- drive_download(
  as_id("1E4ZPUbR98vLW5hmL0GYYQ-GYPDP7vriR"), path = temp_r_file, overwrite = TRUE)
tdm <- readRDS(d_rds_file$local_path)
tmod <- textmodel_svm(tdm, y = tdm$type, weight = "uniform", verbose=TRUE)
thatchermo commented 2 years ago

Hi! Just wanted to comment that I too had issues with textmodel_svm, although in my case, just ballooning of used memory and an unceremonious crash of R each time I ran it with my full data set. Cutting the size of my data set led to no crashes. Creating my own duplicate of the textmodels_svm function with the variance check lines commented out led to the modified function working correctly on my full data set and an order of magnitude less memory used (maybe around 150 Mb, as opposed to 4 Gb). I didn't need to make any further changes, like the previous poster did.

Is that constant variance check line somehow un-sparsifying the matrix?

In a sense, it's easy to work around by just converting the dfm itself into matrix.csr format and using LiblineaR directly, but ti's nice to present code to students that contains fewer packages and conversion steps, so it would be nice if the quanteda command worked without error. Thanks for all your work on this package!

kbenoit commented 2 years ago

Thanks @thatchermo - we are still working out kinks in this function when it comes to larger matrix sizes.

When you say "with the variance check lines commented out" do you mean https://github.com/quanteda/quanteda.textmodels/blob/17f1c84f6b5fd3c063c2389e3259e907bacc3957/R/textmodel_svm.R#L63-L65

It would be great to have a reproducible example for what you describe, so we can run more tests. If the dataset cannot be posted here, feel free to email me outside of GitHub.

koheiw commented 2 years ago

@kbenoit proxyC::colSds() == 0 would be better for large sparse matrices.

thatchermo commented 2 years ago

Here's an example and data (came from Kaggle originally). Note that on a beefier machine (16G ram), each try did run, but the unmodified textmodels_svm resulted in 11.5G memory use, most of which disappeared when I did gc(). I didn't realize at first that the low memory use I saw from LiblineaR used directly seemed to be from using type=2, rather than type=1. But it's still the case that commenting out the variance checking lines resulted in less memory use, even with type=1 in LiblineaR. svm_example.zip