presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models.

bnicenboim commented 1 year ago

Submitting Author Name: Bruno Nicenboim Submitting Author Github Handle: !--author1-->@bnicenboim<!--end-author1-- Repository: https://github.com/bnicenboim/pangoling Submission type: Pre-submission Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9002
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "bruno.nicenboim@gmail.com",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 2.10)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [x] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [x] text analysis
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [ ] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The package is a wrapper around transformers python package, and it can tokenize, get word predictability and calculate perplexity which is text analysis.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

NA

Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as ERP and reading.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Any other questions or issues we should be aware of?:

Yes, the output of pkgcheck fails only because of the use of <<-. But this is done in order to use memoise, as it is recommended in its page. The <<- in the package appears inside .onLoad <- function(libname, pkgname) {

 # caching:
  tokenizer <<- memoise::memoise(tokenizer)
  lang_model <<- memoise::memoise(lang_model)
  transformer_vocab <<- memoise::memoise(transformer_vocab)

The pkgcheck output is the following:

── pangoling 0.0.0.9002 ────────────────────────────

✔ Package name is available ✔ has a 'codemeta.json' file. ✔ has a 'contributing' file. ✔ uses 'roxygen2'. ✔ 'DESCRIPTION' has a URL field. ✔ 'DESCRIPTION' has a BugReports field. ✔ Package has at least one HTML vignette ✔ All functions have examples. ✖ Package uses global assignment operator ('<<-'). ✔ Package has continuous integration checks. ✔ Package coverage is 94.4%. ✔ R CMD check found no errors. ✔ R CMD check found no warnings.

ℹ Current status: ✖ This package is not ready to be submitted.

maurolepore commented 1 year ago

Thanks @bnicenboim for your pre-submission. I'll come back to you ASAP. Thanks also for explaining your use of <<-.

maurolepore commented 1 year ago

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.

[x] ml01. Must be specific to research fields, not general computing utilities.

What research field is pangoling specific for? Is that psycho/neuro/- linguists?

[x] ml02. Must be non-trivial.

Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Other scope considerationos

[x] ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.

The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

Package overlap

[x] ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.

Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

[x] ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

Thanks for your patience :-)

bnicenboim commented 1 year ago

Ok, sure I answer inline.

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.
* [x]  ml01. Must be specific to research fields, not general computing utilities.
What research field is pangoling specific for? Is that psycho/neuro/- linguists?

Yes, it's common to use word predictability as a predictor in models, and pangoling extract predictability from the transformer models.

* [ ]  ml02. Must be non-trivial.
Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Transformer models are "meant" to be used for computational linguistic tasks. For example gpt-like models produce a (random) continuation given a context. That's trivial to get, since there is a short-cut call called pipeline() in python that does exactly that. The thing is that one can also get the probability of each word in a given text without generating anything, that's less trivial to obtain, but it's very useful in -linguistics. It's less trivial because one needs to know how to set up the language model, then one obtains a huge tensor (which is not trivial to manipulate for most R users) and finally one needs to take care of the mapping between words and phrases (the important thing in ling) and the tokens (which is how the model is encoding the words), and this correspondence might be one to one or one to many. For Bert-like model the challenges are similar. Crucially one needs to understand how these large language transformer models work. There are two contributors of the package, that was their role, explaining me how these models work so that I could figure out what were the python functions I needed :)

Also the point of using memoise is that the package is not object oriented (like R6) which are more confusing for most basic R users, and the package is completely functional-based. It just remembers what was the last type of language model that was used.

I hope it's clearer, but feel free to ask!

Other scope considerationos
* [ ]  ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.
The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

text brings transformers python package to R, and adds some machine learning stuff. I would say that also here, the overlap is that text is more general and it doesn't allow for generating pangoling output in a straightforward way. In fact, I'm not sure if it's even possible since it seems more limited than transformers. I would say that the users of pangoling would be mostly psycho/neuro linguists, while the users of text are computational linguists.

Package overlap
* [ ]  ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.
Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

I think I answered in the previous point. I'm not even sure that you can get the output of pangoling just using text. I think the overlap is that they are both wrappers of transformers.

* [ ]  ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

I'm not sure if I understand this. One would need to set up the models in python, then extract the tensors and manipulate them. Finally, one needs to take care of the mapping between words and tokens. But python is not needed in the last step.

Thanks for your patience :-)

ok, there was a lot of overlap in my answers, so feel free to ask me more specific questions if something is not clear.

maurolepore commented 1 year ago

@bnicenboim, I'm still discussing the scope with other editors.

[x] ml06. Did you consider submitting pangoling as as stats package? If so, what convinced you to submit as a general package?

bnicenboim commented 1 year ago

On Fri, Feb 17, 2023, 5:29 PM Mauro Lepore @.***> wrote:

@bnicenboim https://github.com/bnicenboim, I'm still discussing the scope with other editors.

Ok, sure no problem.

ml06. Did you consider submitting pangoling as as stats package https://stats-devguide.ropensci.org/? If so, what convinced you to submit as a general package?

Sorry, why as a stats package? it doesn't do any statistics. I guess it's an NLP package if I'm forced to put it in a category.

— Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/573#issuecomment-1434881697, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNUQ6RPB3WRRSXXEKU3QUTWX6RNRANCNFSM6AAAAAAU22HXEU . You are receiving this because you were mentioned.Message ID: @.***>

maurolepore commented 1 year ago

Thanks.

I ask because the category "text analysis" in the standard-package guide states:

Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review.

Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.

bnicenboim commented 1 year ago

Thanks for the clarification, i found this under statistical software:

Bayesian and Monte Carlo Routines
Dimensionality Reduction, Clustering, and Unsupervised Learning
Machine Learning
Regression and Supervised Learning
Exploratory Data Analysis (EDA) and Summary Statistics
Spatial Analyses
Time Series Analyses

And the package doesn't fall into any of those, it's not doing machine learning either. So I think it's fine under general.

On Fri, Feb 17, 2023, 7:08 PM Mauro Lepore @.***> wrote:

Thanks.

I ask because the category "text analysis" in the standard-package guide states:

Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review https://stats-devguide.ropensci.org/.

Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.

— Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/573#issuecomment-1435052429, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNUQ6VUVAWBBIFT3JCNLADWX65BVANCNFSM6AAAAAAU22HXEU . You are receiving this because you were mentioned.Message ID: @.***>

maurolepore commented 1 year ago

@bnicenboim

I now have enough opinions from the editorial team to consider this package in scope. Please go ahead with a full submission.

Thanks for your patience.

bnicenboim commented 1 year ago

Thanks, should I do something about Package uses global assignment operator ('<<-').. If I add pkgcheck in github actions it will just fail.

maurolepore commented 1 year ago

Please use the same justification you wrote here.

mpadge commented 1 year ago

@bnicenboim @maurolepore I've just updated pkgcheck via the issue linked above to allow <<- in an .onLoad function for use of memoise. The {pangoling} package still fails because it also has two .onLoad entrys for <<- reticulate::import. This is also recommended in the reticulate package. Although not yet permitted in pkgcheck, this use for reticulate imports will also be permitted soon, and the pangoling package will then pass all tests. In the meantime, @bnicenboim please simply add an explanatory note, and links to this comment or the pkgcheck issue as you see fit. thanks

maurolepore commented 1 year ago

Closing because there is now a full submission at #575

ropensci / software-review