quanteda / quanteda.classifiers

quanteda textmodel extensions for classifying documents
21 stars 2 forks source link

Unable to transform Object to Python #13

Closed msaeltzer closed 4 years ago

msaeltzer commented 5 years ago

Hey Quanteda Team, awesome work you are doing! I was playing around with your package after seeing the presentation at QTA Dublin, but noticed that the keras implementation of nnseq doesn't work (the example is not reproducable). I assume this is because keras does not accept sparse matrices to this point. I worked around it by conversing the dfm into a regular matrix inside a debugged function, but I would hope to keep the high speed utility of the quanteda dfms. I'm aware this package is not finished but I find it already very convenient to stay in the quanteda environment. I would be very interested to see how you implement this and look forward to your work.

x1<-convert(x,to='matrix') history <- fit(model, x1, y2, ...) https://github.com/quanteda/quanteda.classifiers/blob/6bf814ce7775c77cda2a2aa733e78a03fe49c689/R/textmodel_nnseq.R#L83

Best regards, Marius

JBGruber commented 5 years ago

FYI, I don't have any issues with the example for textmodel_nnseq:

library(quanteda)
#> Package version: 1.5.0
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(quanteda.corpora)
library(quanteda.classifiers)

# create a dataset with evenly balanced coded and uncoded immigration sentences
corpcoded <- corpus_subset(data_corpus_manifestosentsUK, !is.na(crowd_immigration_label))
corpuncoded <- data_corpus_manifestosentsUK %>%
  corpus_subset(is.na(crowd_immigration_label) & year > 1980) %>%
  corpus_sample(size = ndoc(corpcoded))
corp <- corpcoded + corpuncoded

# form a tf-idf-weighted dfm
dfmat <- dfm(corp) %>%
  dfm_tfidf()

set.seed(1000)
tmod <- textmodel_nnseq(dfmat, y = docvars(dfmat, "crowd_immigration_label"),
                        epochs = 5, verbose = 1)
pred <- predict(tmod, newdata = dfm_subset(dfmat, is.na(crowd_immigration_label)))
table(pred)
#> pred
#>     Immigration Not immigration 
#>              77            7245
tail(texts(corpuncoded)[pred == "Immigration"], 10)
#>                                                                                                                                                                                                                                                                                                                      Con_1987_779 
#>                                                                                                                                                                                                                              "Firm but fair immigration controls are essential for harmonious and improving community relations." 
#>                                                                                                                                                                                                                                                                                                                      BNP_2005_897 
#>                                                                                    "At present rates, immigration requires the equivalent of a city the size of Birmingham to be built every five years, and implies that by 2050, Britain will have a population of 90 million people, reducing our country to a tarmac desert." 
#>                                                                                                                                                                                                                                                                                                                      Con_1992_933 
#>                                                                                                                                    "This will include a workable appeal system for applicants under which those with manifestly unfounded claims will be returned quickly to their own country or to the country they came from." 
#>                                                                                                                                                                                                                                                                                                                      SNP_2017_803 
#>                                                                                                                                                                                                                                          "We need to fundamentally change the UK government's system for housing asylum seekers." 
#>                                                                                                                                                                                                                                                                                                                      BNP_2005_967 
#>                                                                                                    "The ‘Clash of Civilisations’ The BNP is widely known as the only British political party warning of the danger posed to our democracy, traditions and freedoms by the creeping Islamification and dhimmitude of Britain." 
#>                                                                                                                                                                                                                                                                                                                      Con_2017_757 
#>                           "A COUNTRY THAT COMES TOGETHER  Controlling immigration Britain is an open economy and a welcoming society and we will always ensure that our British businesses can recruit the brightest and best from around  the world and Britain's world-class universities can attract international  students." 
#>                                                                                                                                                                                                                                                                                                                       Gr_2005_422 
#>                                                                                                                                                                                             "Policy Policy community: Brian Heatley, Danny Bates, Hugo Charlton, Jonathan Dixon, Alan Francis, Molly Scott-Cato, John Whitelegg." 
#>                                                                                                                                                                                                                                                                                                                      Con_2015_367 
#>                                                                                                                "To help communities experiencing high and unexpected volumes of immigration, we will introduce a new Controlling Migration Fund to ease pressures on services and to pay for additional immigration enforcement." 
#>                                                                                                                                                                                                                                                                                                                       LD_1987_115 
#> "There should be effective rights of appeal against refusal of citizenship and referral to an independent body in cases of deportation, and immigration procedures should be revised so as to promote family unity without significantly affecting immigration totals, which remain lower than rates of emigration from Britain." 
#>                                                                                                                                                                                                                                                                                                                       LD_2001_845 
#>                                                                                                                                                                                                                                                       "We will reform current immigration laws so that families are not divided."

Maybe we are using different versions of keras?

keras:::keras_version()
#> [1] '2.2.4'

Created on 2019-07-05 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.0 (2019-04-26) #> os Ubuntu 18.04.2 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_GB:en #> collate en_GB.UTF-8 #> ctype en_GB.UTF-8 #> tz Europe/Berlin #> date 2019-07-05 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> base64enc 0.1-3 2015-07-28 [1] #> callr 3.3.0 2019-07-04 [1] #> cli 1.1.0 2019-03-19 [1] #> cluster 2.1.0 2019-06-19 [4] #> colorspace 1.4-1 2019-03-18 [1] #> crayon 1.3.4 2017-09-16 [1] #> data.table 1.12.2 2019-04-07 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.0.2 2019-04-08 [1] #> digest 0.6.20 2019-07-04 [1] #> dplyr 0.8.3 2019-07-04 [1] #> evaluate 0.14 2019-05-28 [1] #> fastmatch 1.1-0 2017-01-28 [1] #> fs 1.3.1 2019-05-06 [1] #> generics 0.0.2 2018-11-29 [1] #> ggplot2 3.2.0 2019-06-16 [1] #> glue 1.3.1 2019-03-12 [1] #> gtable 0.3.0 2019-03-25 [1] #> highr 0.8 2019-03-20 [1] #> htmltools 0.3.6 2017-04-28 [1] #> jsonlite 1.6 2018-12-07 [1] #> keras 2.2.4.1 2019-04-05 [1] #> kernlab 0.9-27 2018-08-10 [1] #> knitr 1.23 2019-05-18 [1] #> lattice 0.20-38 2018-11-04 [4] #> lazyeval 0.2.2 2019-03-15 [1] #> LiblineaR 2.10-8 2017-02-13 [1] #> lubridate 1.7.4 2018-04-11 [1] #> magrittr 1.5 2014-11-22 [1] #> MASS 7.3-51.4 2019-04-26 [1] #> Matrix 1.2-17 2019-03-22 [4] #> memoise 1.1.0 2017-04-21 [1] #> munsell 0.5.0 2018-06-12 [1] #> pillar 1.4.2 2019-06-29 [1] #> pkgbuild 1.0.3 2019-03-20 [1] #> pkgconfig 2.0.2 2018-08-16 [1] #> pkgload 1.0.2 2018-10-29 [1] #> plyr 1.8.4 2016-06-08 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.4.0 2019-07-03 [1] #> ps 1.3.0 2018-12-21 [1] #> purrr 0.3.2 2019-03-15 [1] #> quadprog 1.5-7 2019-05-06 [1] #> quanteda * 1.5.0 2019-07-04 [1] #> quanteda.classifiers * 0.1 2019-06-15 [1] #> quanteda.corpora * 0.87 2019-04-28 [1] #> R6 2.4.0 2019-02-14 [1] #> Rcpp 1.0.1 2019-03-17 [1] #> RcppParallel 4.4.3 2019-05-22 [1] #> remotes 2.1.0 2019-06-24 [1] #> reshape2 1.4.3 2017-12-11 [1] #> reticulate 1.12 2019-04-12 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.13.1 2019-05-27 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> RSSL 0.8 2019-03-08 [1] #> scales 1.0.0 2018-08-09 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> spacyr 1.2 2019-07-04 [1] #> SparseM 1.77 2017-04-23 [1] #> stopwords 0.9.0 2017-12-14 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0 2019-02-10 [1] #> tensorflow 1.13.1 2019-04-05 [1] #> testthat 2.1.1 2019-04-23 [1] #> tfruns 1.4 2018-08-25 [1] #> tibble 2.1.3 2019-06-06 [1] #> tidyselect 0.2.5 2018-10-11 [1] #> usethis 1.5.1 2019-07-04 [1] #> whisker 0.3-2 2013-04-28 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.8 2019-06-25 [1] #> yaml 2.2.0 2018-07-25 [1] #> zeallot 0.1.0 2018-01-28 [1] #> source #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.5.1) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.5.3) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (quanteda/quanteda.classifiers@6bf814c) #> Github (quanteda/quanteda.corpora@5933cc8) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> Github (rstudio/rmarkdown@5409172) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> CRAN (R 3.6.0) #> #> [1] /home/johannes/R/x86_64-pc-linux-gnu-library/3.6 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
msaeltzer commented 5 years ago

also 2.2.4. I kept getting the error that it can't be transformed to python object. The same also for my own data i generated from quanteda. Can Keras interpret quanteda-dfms?

stefan-mueller commented 5 years ago

@msaeltzer, could you please paste the output of sessionInfo() into a separate comment after you have loaded quanteda.classifiers? The example in ?textmodel_ nnseq works fine on @pchest's and my machine too.

msaeltzer commented 5 years ago
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.classifiers_0.1 quanteda_1.5.0           keras_2.2.4.1           

loaded via a namespace (and not attached):
 [1] reticulate_1.12.0-9003 tidyselect_0.2.5       purrr_0.3.2           
 [4] reshape2_1.4.3         kernlab_0.9-27         lattice_0.20-35       
 [7] colorspace_1.4-1       generics_0.0.2         LiblineaR_2.10-8      
[10] yaml_2.2.0             base64enc_0.1-3        rlang_0.4.0           
[13] pillar_1.4.2           glue_1.3.1             semver_0.2.0          
[16] plyr_1.8.4             tensorflow_1.13.1.9000 stringr_1.4.0         
[19] munsell_0.5.0          binman_0.1.1           gtable_0.3.0          
[22] SparseM_1.77           wdman_0.2.4            tfruns_1.4            
[25] Rcpp_1.0.1             spacyr_1.2             scales_1.0.0          
[28] RcppParallel_4.4.3     jsonlite_1.6           fastmatch_1.1-0       
[31] stopwords_0.9.0        ggplot2_3.2.0          stringi_1.4.3         
[34] dplyr_0.8.3            grid_3.5.1             quadprog_1.5-7        
[37] tools_3.5.1            magrittr_1.5           lazyeval_0.2.2        
[40] tibble_2.1.3           cluster_2.0.7-1        crayon_1.3.4          
[43] whisker_0.3-2          pkgconfig_2.0.2        zeallot_0.1.0         
[46] MASS_7.3-50            Matrix_1.2-14          data.table_1.12.2     
[49] RSSL_0.8               lubridate_1.7.4        assertthat_0.2.1      
[52] rstudioapi_0.10        R6_2.4.0               compiler_3.5.1  
msaeltzer commented 5 years ago

I use Keras 2.2.4.1 since yesterday, I updated it for a different reason but does not work on this one IMO. Best regards, Marius Error message:

> tmod <- textmodel_nnseq(dfmat, y = docvars(dfmat, "crowd_immigration_label"),
+                         epochs = 5, verbose = 1)
Document-feature matrix of: 7,322 documents, 14,555 features (99.9% sparse).
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  Evaluation error: Unable to convert R object to Python type.

Traces back to the line I highlighted. Could be the reticulate version I use IMO.

msaeltzer commented 5 years ago

Also:

reticulate::py_config()

python:         C:\Users\admin\Anaconda3\envs\r-tensorflow\python.exe
libpython:      C:/Users/admin/Anaconda3/envs/r-tensorflow/python36.dll
pythonhome:     C:\Users\admin\ANACON~1\envs\R-TENS~1
version:        3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:\Users\admin\ANACON~1\envs\R-TENS~1\lib\site-packages\numpy
numpy_version:  1.16.4
tensorflow:     C:\Users\admin\ANACON~1\envs\R-TENS~1\lib\site-packages\tensorflow\__init__.p

python versions found: 
 C:\Users\admin\Anaconda3\envs\r-tensorflow\python.exe
 C:\Users\admin\AppData\Local\Programs\Python\Python37\\python.exe
 C:\Users\admin\ANACON~1\python.exe
 C:\Users\admin\Anaconda3\python.exe
kbenoit commented 4 years ago

@msaeltzer We've finally returned to this package and it's now working with the 2.0 version of Tensorflow. I'm going to close this issue but if you experience the problem again, pls let us know.