quanteda / quanteda.textplots

Plotting and visualisation for quanteda
GNU General Public License v3.0
6 stars 1 forks source link

combine with textplot #11

Closed jwijffels closed 3 years ago

jwijffels commented 3 years ago

Now that you are factoring out quanteda in different R packages to modularize it. Why not consider adding all plotting functionalities in 1 package. I've put package textplot on CRAN: https://github.com/bnosac/textplot & https://CRAN.R-project.org/package=textplot, example viz at https://cran.r-project.org/web/packages/textplot/vignettes/textplot-examples.pdf with as main reason not to have the dependency chain in other more core R packages which I tend to use (https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-universe.html)

Package philosophy is to put in Imports only packages which are base packages or recommended packages + data.table. The other packages can be put in Suggests. In that way people can choose whichever data preparation functionalities they want to use (whether base / tidyverse / quanteda / text2vec / tm / udpipe / qdap / tidytext / torch / ...) and if they need a specific plotting function, they will have to load the required packages for it. Not sure if this is an approach quanteda is looking into (given that such an approach requires users to understand to install also other packages which might not be directly installed if you install the plotting package) , it's sure is a lot of work to do given the dependency chain you have in the Imports part of the DESCRIPTION but maybe you are open to this idea.

kbenoit commented 3 years ago

That's a great idea and part of the motivation for modularisation. We have a lot of very efficient methods in quanteda.textmodels as well, for instance, that work with quanteda dfm objects but also would work with simpler classes of sparse Matrix class objects. Once the package modularisation is complete, we could further modularise those and the plotting methods to be generics that work with plain objects as well as having methods for quanteda classes of objects. Right now, our quanteda. packages require quanteda as an Import, but that does not mean they could not also Import the core machinery defined as generics (and being more "generic" in the general meaning of the term) in a separate package that makes this functionality available generally.

In fact part of this was motivated by seeing your textplot package and thinking exactly along the lines of what you suggest here. So I'd say that trying this first for the plotting functions ought to be first on the list.

jwijffels commented 3 years ago

Drawback of this last approach is that if I want to use say quanteda.textplots:::textplot_xyz without the dependency chain of quanteda that is not possible. And quanteda already has dependency chain of 81 pkgs.

tools::package_dependencies("quanteda", recursive = TRUE, which = c("Depends", "Imports"))
$quanteda
 [1] "methods"        "data.table"     "extrafont"      "fastmatch"      "ggplot2"        "ggrepel"       
 [7] "jsonlite"       "magrittr"       "Matrix"         "network"        "Rcpp"           "RcppParallel"  
[13] "sna"            "SnowballC"      "stopwords"      "stringi"        "xml2"           "yaml"          
[19] "proxyC"         "digest"         "utils"          "extrafontdb"    "grDevices"      "Rttf2pt1"      
[25] "glue"           "grid"           "gtable"         "isoband"        "MASS"           "mgcv"          
[31] "rlang"          "scales"         "stats"          "tibble"         "withr"          "graphics"      
[37] "lattice"        "statnet.common" "ISOcodes"       "usethis"        "desc"           "tools"         
[43] "assertthat"     "R6"             "crayon"         "rprojroot"      "nlme"           "splines"       
[49] "farver"         "labeling"       "lifecycle"      "munsell"        "RColorBrewer"   "viridisLite"   
[55] "coda"           "parallel"       "rle"            "cli"            "ellipsis"       "fansi"         
[61] "pillar"         "pkgconfig"      "vctrs"          "clipr"          "curl"           "fs"            
[67] "gh"             "git2r"          "purrr"          "rematch2"       "rstudioapi"     "whisker"       
[73] "gitcreds"       "httr"           "ini"            "colorspace"     "utf8"           "mime"          
[79] "openssl"        "askpass"        "sys"

My point is that a lot of plotting functonalities could be reduced to a core and this core could be put into a overall main text plotting package which depends on only plotting packages. While the data preparation in order to put data into that core plotter can be chosen (whether base / tidyverse / quanteda / text2vec / tm / udpipe / qdap / tidytext / torch / ...) and is left to the user.

jwijffels commented 3 years ago

A typical example could be the keyness function in quanteda.textstats now. This computes e.g. a chi-square statistic and orders the words. Fine. I think the core textplot package should contain something like textplot_xyz with as input word/information metric of importance/group and next quanteda.textplots can just provide textplot_xyz.dfm or textplot_xyz.tokens or textplot_xyz.textstatabc while another package can do exactly the same and define textplot_xyz.word2vecabc textplot_xyz.ihaveanotherstatistic while still using the same plotting backend. Or even decide to put these functions inside the textplot package

kbenoit commented 3 years ago

Well the modularisation is a step to paring down that dependency chain...

What I meant was that would have: (where x <--- y means y Imports x)

quanteda  <---  quanteda.textplots   
                quanteda.textplots  --->  textplot

                 <other text pkg>   --->  textplot

where quanteda.textplots Imports textplot and defines methods for dfm, fcm, etc classes of (quanteda) objects for e.g. textplot_cooccurrence(). That would even work with the existing textplots, since you defined the functions as generics, so in this sense, it would require no changes to textplots to implement your second point above. We'd just put a wrapper around it to make it work nicely with the quanteda native objects.

So textplots would add no new dependencies, while quanteda.textplots would continue to have quanteda as a dependency. Anyone wishing to add functionality for text plotting to their own framework package would follow the quanteda.textplots model and Import textplots for the plotting machinery, but define their interface via their own package.

This still makes complete sense from the "modularizing quanteda" standpoint since the goal there is to pare quanteda itself down to a core set of textual data handling functions that can be used without the modelling functions or the plotting functions.

jwijffels commented 3 years ago

Yes, that completely makes sense from a modularizing quanteda perspective. In that case nothing needs to change indeed at the textplot package.

kbenoit commented 3 years ago

It would suggest collaboration on textplot for adding and updating functionality but I assume a willingness for that motivated this issue to begin with. :smile:

We use an approach across quanteda where we define every function as a generic, even it has only one method -- same as in textplot -- but define the .default as producing a friendly error message explaining the object classes that are legal inputs. I noticed that your .default methods were the expected input object, rather than say defining a textplot_cooccurrence.data.frame(). No biggie though as long as the methods we extend will have an additional class extending data.frame, if that's what the object is built on.

kbenoit commented 3 years ago

A typical example could be the keyness function in quanteda.textstats now. This computes e.g. a chi-square statistic and orders the words. Fine. I think the core textplot package should contain something like textplot_xyz with as input word/information metric of importance/group and next quanteda.textplots can just provide textplot_xyz.dfm or textplot_xyz.tokens or textplot_xyz.textstatabc while another package can do exactly the same and define textplot_xyz.word2vecabc textplot_xyz.ihaveanotherstatistic while still using the same plotting backend. Or even decide to put these functions inside the textplot package

exactly

jwijffels commented 3 years ago

It would suggest collaboration on textplot for adding and updating functionality but I assume a willingness for that motivated this issue to begin with.

Yes, don't repeat yourself while plotting seems to be the first step instead of creating different NLP plotting backends again :). And nudging everyone in that direction was the intent. And seeing some level of agreement on this was the hope.

kbenoit commented 3 years ago

We'll aim to complete the quanteda modularisation to keep the functionality as is, but separated. There's a lot of revdeps and other issues we have to deal with transitionally for CRAN compatibility - and not breaking too much user code out there. Then happy can then starting working on textplots integration and collaboration. Not just DRY but also more inputs tend to produce better products on focused efforts like this.

jwijffels commented 3 years ago

Ok. Let me close this. I'll open an issue in the textplot pkg to give a friendly error msg in case of wrong data input.