saraswatmks / superml

Build machine learning models in R like using python's scikit-learn library
https://saraswatmks.github.io/superml/
GNU General Public License v3.0
32 stars 8 forks source link

TfIdfVectorizer Transform Problem #32

Closed jonyclee closed 4 years ago

jonyclee commented 4 years ago

Hi Administrator,

I was using the superml package. I have trained the TfIdfVectorizer and was able to transform it on the training set with no problem; however, when I try to transform the testing set, I end up getting the same TfIdf matrix as the training set.

In addition, to this error, II ran into a bug trying to install the latest version on Windows. When I try to install this package, I get this error. However, it works perfectly fine, when I try to install this on a Mac.

C:/Rtools/mingw_64/x86_64-w64-mingw32/include/c++/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

error This file requires compiler and library support for the \

^ utils.cpp: In function 'Rcpp::CharacterVector superSplit(std::string, char)': utils.cpp:15:25: error: 'move' is not a member of 'std' elems.push_back(std::move(item)); ^ utils.cpp: In function 'std::vector<std::basic_string > superNgrams(std::string, Rcpp::NumericVector, char)': utils.cpp:45:68: error: '>>' should be '> >' within a nested template argument list std::vector rx = as<std::vector>(r); ^ utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)': utils.cpp:63:9: warning: 'auto' changes meaning in C++11; please remove it [-Wc++0x-compat] for(auto i: string){ ^ utils.cpp:63:14: error: 'i' does not name a type for(auto i: string){ ^ utils.cpp:72:5: error: expected ';' before 'return' return output; ^ utils.cpp:72:5: error: expected primary-expression before 'return' utils.cpp:72:5: error: expected ';' before 'return' utils.cpp:72:5: error: expected primary-expression before 'return' utils.cpp:72:5: error: expected ')' before 'return' utils.cpp: In function 'Rcpp::NumericMatrix superCountMatrix(std::vector<std::basic_string >, std::vector<std::basic_string >)': utils.cpp:82:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for(int i=0; i < sent.size(); i++){ ^ utils.cpp:85:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for(int j=0; j < tokens.size(); j++){ ^ utils.cpp:86:13: error: 'regex' was not declared in this scope regex e = std::regex("\b" + tokens[j] + "\b"); ^ utils.cpp:87:42: error: 'e' was not declared in this scope string m = regex_replace (s, e, ""); ^ utils.cpp:87:47: error: 'regex_replace' was not declared in this scope string m = regex_replace (s, e, ""); ^ utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)': utils.cpp:74:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ make: *** [C:/PROGRA~1/MICROS~3/ROPEN~1/R-35~1.3/etc/x64/Makeconf:215: utils.o] Error 1 ERROR: compilation failed for package 'superml'

Can you please help? Thanks.

Best, jonyclee

saraswatmks commented 4 years ago

Hi @jonyclee thanks for the issue. Few things:

  1. Which superml version are you using?
  2. Please provide sessionInfo() output.

Tfidf shouldn't give the same output. I fixed it in a recent version. Although, I found another bug which could have affected the output. I've fixed it now. For now, you can use the dev version of superml. You can simply do:

devtools::install_github("saraswatmks/superml")
jonyclee commented 4 years ago

Hi @saraswatmks ,

Thank you for the quick reply. I am using superml v0.5.2. In addition, as I was playing around with the program yesterday, I noticed that GridSearchCV does not allow for LMTrainer. I was just wondering why? Once again, thank you so much for your time and help.

My sessionInfo output below.

R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] e1071_1.7-1 nnet_7.3-12 superml_0.5.2 R6_2.3.0 readxl_1.3.1 usethis_1.5.0 devtools_2.0.2 emcresearch_1.3.1
[9] dplyr_0.8.0.1 haven_2.1.0 readr_1.3.1 forcats_0.4.0 openxlsx_4.1.0 data.table_1.12.2 anesrake_0.80 weights_1.0
[17] mice_3.4.0 gdata_2.18.0 Hmisc_4.2-0 ggplot2_3.1.1 Formula_1.2-3 survival_2.43-3 lattice_0.20-38 RevoUtils_11.0.3
[25] RevoUtilsMath_11.0.0

loaded via a namespace (and not attached): [1] nlme_3.1-137 fs_1.2.7 doParallel_1.0.14 RColorBrewer_1.1-2 rprojroot_1.3-2 tools_3.5.3 backports_1.1.4 rpart_4.1-13
[9] lazyeval_0.2.2 colorspace_1.4-1 jomo_2.6-7 withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3 prettyunits_1.0.2 processx_3.3.0
[17] curl_3.3 compiler_3.5.3 cli_1.1.0 htmlTable_1.13.1 desc_1.2.0 scales_1.0.0 checkmate_1.9.1 callr_3.2.0
[25] stringr_1.4.0 digest_0.6.18 foreign_0.8-71 minqa_1.2.4 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 lme4_1.1-21
[33] sessioninfo_1.1.1 htmlwidgets_1.3 rlang_0.4.0 rstudioapi_0.10 generics_0.0.2 gtools_3.8.1 acepack_1.4.1 zip_2.0.1
[41] magrittr_1.5 Matrix_1.2-15 Rcpp_1.0.1 munsell_0.5.0 Metrics_0.1.4 stringi_1.4.3 MASS_7.3-51.1 pkgbuild_1.0.3
[49] plyr_1.8.4 grid_3.5.3 parallel_3.5.3 mitml_0.3-7 crayon_1.3.4 splines_3.5.3 hms_0.4.2 knitr_1.22
[57] ps_1.3.0 pillar_1.3.1 boot_1.3-20 codetools_0.2-16 pkgload_1.0.2 pan_1.6 glue_1.3.1 latticeExtra_0.6-28 [65] remotes_2.0.4 foreach_1.5.1 nloptr_1.2.1 cellranger_1.1.0 testthat_2.0.1 gtable_0.3.0 purrr_0.3.2 tidyr_0.8.3
[73] assertthat_0.2.1 xfun_0.6 broom_0.5.2 class_7.3-15 tibble_2.1.1 iterators_1.0.11 memoise_1.1.0 cluster_2.0.7-1

Best, jonyclee