Pre-submission inquiry for {kgrams}: Classical k-gram Language Models

Submitting Author: Valerio Gherardi (@vgherard)
Repository: https://github.com/vgherard/kgrams Submission type: Pre-submission

Paste the full DESCRIPTION file inside a code block below:

Package: kgrams
Title: Classical k-gram Language Models
Version: 0.1.0.9000
Authors@R: 
    person(given = "Valerio",
           family = "Gherardi",
           role = c("aut", "cre"),
           email = "vgherard@sissa.it",
           comment = c(ORCID = "0000-0002-8215-3013"))
Description: 
        Tools for training and evaluating k-gram language models in R, 
        supporting several probability smoothing techniques, 
        perplexity computations, random text generation and more.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
SystemRequirements: C++11
LinkingTo: 
    Rcpp, RcppProgress
Imports: 
    Rcpp, rlang, methods, utils,  RcppProgress (>= 0.1), Rdpack
Depends: 
    R (>= 3.5)
Suggests: 
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
RdMacros: Rdpack
VignetteBuilder: knitr
URL: https://vgherard.github.io/kgrams/,
    https://github.com/vgherard/kgrams
BugReports: https://github.com/vgherard/kgrams/issues

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below.:
- [ ] data retrieval
- [ ] data extraction
- [ ] database access
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [x] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This package implements classical k-gram language model algorithms, including utilities for training, evaluation and text prediction. Language models are an angular stone of Natural Language Processing applications, and the conceptual simplicity of k-gram models makes them a good model baseline, also of pedagogical value.

Who is the target audience and what are scientific applications of this package?

The package can be useful for students and/or researchers, for performing small-scale experiments with Natural Language Processing. In addition, it might be helpful in the building of more complex language models, for quick baseline modeling.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

I am not aware of any R package with same purpose and functionalities of kgrams. The CRAN package ngram has some relative overlap in scope, in that it provides k-gram tokenization algorithms, but offers no support for language model algorithms.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

Any other questions or issues we should be aware of?:

The package was accepted some months ago by CRAN.
Despite the "lifecycle:experimental" badge and the development version number, I am not currently planning any important API change or additional feature for this package (with the exception for feedback/suggestions which might originate from an rOpenSci review, of course).

ropensci / software-review

Pre-submission inquiry for {kgrams}: Classical k-gram Language Models #450

Scope