Closed bnicenboim closed 1 year ago
Thanks @bnicenboim for your pre-submission. I'll come back to you ASAP.
Thanks also for explaining your use of <<-
.
Dear @bnicenboim,
It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.
scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.
What research field is pangoling specific for? Is that psycho/neuro/- linguists?
Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.
The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?
Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?
Thanks for your patience :-)
Ok, sure I answer inline.
Dear @bnicenboim,
It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.
Package categories
scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.
* [x] ml01. Must be specific to research fields, not general computing utilities.
What research field is pangoling specific for? Is that psycho/neuro/- linguists?
Yes, it's common to use word predictability as a predictor in models, and pangoling extract predictability from the transformer models.
* [ ] ml02. Must be non-trivial.
Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.
Transformer models are "meant" to be used for computational linguistic tasks. For example gpt-like models produce a (random) continuation given a context. That's trivial to get, since there is a short-cut call called pipeline()
in python that does exactly that. The thing is that one can also get the probability of each word in a given text without generating anything, that's less trivial to obtain, but it's very useful in -linguistics. It's less trivial because one needs to know how to set up the language model, then one obtains a huge tensor (which is not trivial to manipulate for most R users) and finally one needs to take care of the mapping between words and phrases (the important thing in ling) and the tokens (which is how the model is encoding the words), and this correspondence might be one to one or one to many. For Bert-like model the challenges are similar. Crucially one needs to understand how these large language transformer models work. There are two contributors of the package, that was their role, explaining me how these models work so that I could figure out what were the python functions I needed :)
Also the point of using memoise
is that the package is not object oriented (like R6) which are more confusing for most basic R users, and the package is completely functional-based. It just remembers what was the last type of language model that was used.
I hope it's clearer, but feel free to ask!
Other scope considerationos
* [ ] ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.
The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?
text
brings transformers
python package to R, and adds some machine learning stuff. I would say that also here, the overlap is that text
is more general and it doesn't allow for generating pangoling
output in a straightforward way. In fact, I'm not sure if it's even possible since it seems more limited than transformers
.
I would say that the users of pangoling would be mostly psycho/neuro linguists, while the users of text are computational linguists.
Package overlap
* [ ] ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.
Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?
I think I answered in the previous point. I'm not even sure that you can get the output of pangoling just using text
. I think the overlap is that they are both wrappers of transformers
.
* [ ] ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?
I'm not sure if I understand this. One would need to set up the models in python, then extract the tensors and manipulate them. Finally, one needs to take care of the mapping between words and tokens. But python is not needed in the last step.
Thanks for your patience :-)
ok, there was a lot of overlap in my answers, so feel free to ask me more specific questions if something is not clear.
@bnicenboim, I'm still discussing the scope with other editors.
On Fri, Feb 17, 2023, 5:29 PM Mauro Lepore @.***> wrote:
@bnicenboim https://github.com/bnicenboim, I'm still discussing the scope with other editors.
Ok, sure no problem.
- ml06. Did you consider submitting pangoling as as stats package https://stats-devguide.ropensci.org/? If so, what convinced you to submit as a general package?
Sorry, why as a stats package? it doesn't do any statistics. I guess it's an NLP package if I'm forced to put it in a category.
— Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/573#issuecomment-1434881697, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNUQ6RPB3WRRSXXEKU3QUTWX6RNRANCNFSM6AAAAAAU22HXEU . You are receiving this because you were mentioned.Message ID: @.***>
Thanks.
I ask because the category "text analysis" in the standard-package guide states:
Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review.
Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.
Thanks for the clarification, i found this under statistical software:
And the package doesn't fall into any of those, it's not doing machine learning either. So I think it's fine under general.
On Fri, Feb 17, 2023, 7:08 PM Mauro Lepore @.***> wrote:
Thanks.
I ask because the category "text analysis" in the standard-package guide states:
Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review https://stats-devguide.ropensci.org/.
Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.
— Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/573#issuecomment-1435052429, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNUQ6VUVAWBBIFT3JCNLADWX65BVANCNFSM6AAAAAAU22HXEU . You are receiving this because you were mentioned.Message ID: @.***>
@bnicenboim
I now have enough opinions from the editorial team to consider this package in scope. Please go ahead with a full submission.
Thanks for your patience.
Thanks, should I do something about Package uses global assignment operator ('<<-').
. If I add pkgcheck in github actions it will just fail.
Please use the same justification you wrote here.
@bnicenboim @maurolepore I've just updated pkgcheck
via the issue linked above to allow <<-
in an .onLoad
function for use of memoise. The {pangoling} package still fails because it also has two .onLoad
entrys for <<- reticulate::import
. This is also recommended in the reticulate package. Although not yet permitted in pkgcheck
, this use for reticulate imports will also be permitted soon, and the pangoling package will then pass all tests. In the meantime, @bnicenboim please simply add an explanatory note, and links to this comment or the pkgcheck issue as you see fit. thanks
Closing because there is now a full submission at #575
Submitting Author Name: Bruno Nicenboim Submitting Author Github Handle: !--author1-->@bnicenboim<!--end-author1-- Repository: https://github.com/bnicenboim/pangoling Submission type: Pre-submission Language: en
Scope
Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):
Data Lifecycle Packages
[ ] data retrieval
[ ] data extraction
[ ] data munging
[ ] data deposition
[ ] data validation and testing
[ ] workflow automation
[ ] version control
[ ] citation management and bibliometrics
[x] scientific software wrappers
[ ] field and lab reproducibility tools
[ ] database software bindings
[ ] geospatial data
[x] text analysis
Statistical Packages
[ ] Bayesian and Monte Carlo Routines
[ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
[ ] Machine Learning
[ ] Regression and Supervised Learning
[ ] Exploratory Data Analysis (EDA) and Summary Statistics
[ ] Spatial Analyses
[ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
The package is a wrapper around
transformers
python package, and it can tokenize, get word predictability and calculate perplexity which is text analysis.NA
This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as ERP and reading.
Another R package that acts as a wrapper for
transformers
istext
However,text
is more general, and its focus is on Natural Language Processing and Machine Learning.pangoling
is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP.(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Any other questions or issues we should be aware of?:
Yes, the output of pkgcheck fails only because of the use of
<<-
. But this is done in order to use memoise, as it is recommended in its page. The<<-
in the package appears inside.onLoad <- function(libname, pkgname) {
The pkgcheck output is the following:
── pangoling 0.0.0.9002 ────────────────────────────
✔ Package name is available ✔ has a 'codemeta.json' file. ✔ has a 'contributing' file. ✔ uses 'roxygen2'. ✔ 'DESCRIPTION' has a URL field. ✔ 'DESCRIPTION' has a BugReports field. ✔ Package has at least one HTML vignette ✔ All functions have examples. ✖ Package uses global assignment operator ('<<-'). ✔ Package has continuous integration checks. ✔ Package coverage is 94.4%. ✔ R CMD check found no errors. ✔ R CMD check found no warnings.
ℹ Current status: ✖ This package is not ready to be submitted.