Pre-submission inquiry for {kgrams}: Classical k-gram Language Models

vgherard commented 3 years ago

Submitting Author: Valerio Gherardi (@vgherard)
Repository: https://github.com/vgherard/kgrams Submission type: Pre-submission

Paste the full DESCRIPTION file inside a code block below:

Package: kgrams
Title: Classical k-gram Language Models
Version: 0.1.0.9000
Authors@R: 
    person(given = "Valerio",
           family = "Gherardi",
           role = c("aut", "cre"),
           email = "vgherard@sissa.it",
           comment = c(ORCID = "0000-0002-8215-3013"))
Description: 
        Tools for training and evaluating k-gram language models in R, 
        supporting several probability smoothing techniques, 
        perplexity computations, random text generation and more.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
SystemRequirements: C++11
LinkingTo: 
    Rcpp, RcppProgress
Imports: 
    Rcpp, rlang, methods, utils,  RcppProgress (>= 0.1), Rdpack
Depends: 
    R (>= 3.5)
Suggests: 
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
RdMacros: Rdpack
VignetteBuilder: knitr
URL: https://vgherard.github.io/kgrams/,
    https://github.com/vgherard/kgrams
BugReports: https://github.com/vgherard/kgrams/issues

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [ ] database access
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text data
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [x] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This package implements classical k-gram language model algorithms, including utilities for training, evaluation and text prediction. Language models are an angular stone of Natural Language Processing applications, and the conceptual simplicity of k-gram models makes them a good model baseline, also of pedagogical value.

k-gram models are a simple form of Machine-Learning applied to text data; as such, machine-learning is definitely the most appropriate category within the above ones. I would be inclined to define this as an "Unsupervised" learning problem, since the target function being learned (the language's probability distribution over sentences) is clearly not explicit in the training data - but have never seen this particular qualification in the literature.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

Not yet (NB: this is a presubmission inquiry).

Who is the target audience and what are scientific applications of this package?

The package can be useful for students and/or researchers, for performing small-scale experiments with Natural Language Processing. In addition, it might be helpful in the building of more complex language models, for quick baseline modeling.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

I am not aware of any R package with same purpose and functionalities of kgrams. The CRAN package ngram has some relative overlap in scope, in that it provides k-gram tokenization algorithms and random text generation, but offers no support for language model algorithms.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable.

Any other questions or issues we should be aware of?:

The package was accepted some months ago by CRAN.
Despite the "lifecycle:experimental" badge and the development version number, I haven't made important API changes or addied features to this package for a long time. With this submission, I'm hoping to receive your feedback regarding possible improvements and/or whether the package is mature enough to be considered stable.

noamross commented 3 years ago

Thank you for our first statistical package pre-submission, @vgherard! I believe this clearly falls in scope and look forward to a full submission once you have incorporated the srr standards component. I am querying the editorial board to ask for an opinion as to whether this package should also apply standards from the Supervised or Unsupervised learning categories.

vgherard commented 3 years ago

Thanks @noamross, great :) I will begin looking to the srr standards, then. It may take me some time, but I'm up for it. Earlier I did a quick check with autotest and it seems like there's some trouble in parsing some of my examples, let's see if I can get it to work quickly.

noamross commented 3 years ago

Please ping me and @mpadge here with any questions, we know we are working out the kinks in the new system and are eager to help with the process to make it better!

vgherard commented 3 years ago

Thanks @noamross (@mpadge), I've filed an issue at ropensci-review-tools/autotest#49

noamross commented 2 years ago

Hello @vgherard! We're going back to some in-progress submissions that got stuck in an ambiguous state. Sorry that we haven't reached out in a while. I just wanted to see if ropensci peer review is something you were still interested in pursuing.

vgherard commented 2 years ago

Dear @noamross thanks for checking in and sorry for the long silence, I totally forgot about this process being open.

Sadly, right now I'm too short of time for a relatively demanding submission like this... Apart from this, over time I became a bit unsatisfied with certain aspects of this package, which I'd at least try to improve before submitting.

I'll close this, with the hope to come back to it in a not too far future :-)

Thanks!

mpadge commented 1 year ago

@vgherard Any updates on the status of your package? We'd still be very interested in receiving a full submission :+1:

vgherard commented 1 year ago

Dears, thanks for keeping in touch.

I had a look at the requirements I would need to cover in order to submit {kgrams}, and again, sorry but this is too much for me.

The output of pkgcheck() alone looks intimidating - function names, usage of <<-, usage of sapply(), etc.etc.. Also, I imagine that passing autotest and srr would probably be much more demanding.

These are in general quick things, but with a package of the dimension of {kgrams} it takes a good amount of effort to finally get the green light - an effort I'm not really interested into, since the only thing I'm doing with that package at the moment is keeping it alive on CRAN :')

It's understood that when I say "too much" I refer only to my individual case - I think the work you're doing by putting up this review process is awesome.

For next package ideas I will definitely consider implementing ropensci standard from the onset!

mpadge commented 1 year ago

Thanks @vgherard, I definitely understand. It's a shame, but you are probably right that it wouldn't be a trivial amount of work to prepare it. Thanks for considering, and for the kind words, and we look forward to future submissions at any time.

ropensci / software-review

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #452

Scope