wordpredictor: Develop Text Prediction Models Based on N-Grams

Submitting Author: Nadir Latif (@pakjiddat)
Repository: https://github.com/pakjiddat/word-predictor Submission type: Pre-submission

Paste the full DESCRIPTION file inside a code block below:

Package: wordpredictor Title: Develop Text Prediction Models Based on N-Grams Version: 0.0.2 URL: https://github.com/pakjiddat/word-predictor, https://pakjiddat.github.io/word-predictor/ BugReports: https://github.com/pakjiddat/word-predictor/issues Authors@R: person(given = "Nadir", family = "Latif", role = c("aut", "cre"), email = "pakjiddat@gmail.com", comment = c(ORCID = "0000-0002-7543-7405")) Description: A framework for developing n-gram models for text prediction. It provides data cleaning, data sampling, extracting tokens from text, model generation, model evaluation and word prediction. For information on how n-gram models work we referred to: "Speech and Language Processing" https://web.stanford.edu/~jurafsky/slp3/3.pdf. For optimizing R code and using R6 classes we referred to "Advanced R" https://adv-r.hadley.nz/r6.html. For writing R extensions we referred to "R Packages", https://r-pkgs.org/index.html. License: MIT + file LICENSE Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.1.1 Imports: digest, ggplot2, patchwork, stringr, dplyr, SnowballC Suggests: testthat, covr, knitr, rmarkdown, markdown VignetteBuilder: knitr Language: en-US

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below.:
- [ ] data retrieval
- [ ] data extraction
- [ ] database access
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [X] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The package generates n-gram models from plain text files. It allows text file analysis, data cleaning, generation of tokens, generation and evaluation of n-gram models and word prediction. "text analysis" seems to be the most suitable category for the package.
Who is the target audience and what are scientific applications of this package?

The target audience are users who need to analyse text using n-gram models. The package may be used in applications that require word prediction, spell checking, auto completion, search etc.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Well the packages quanteda and tm also allow text analysis. These packages are quite advanced and are widely used for word frequency analysis.

The wordpredictor package differs from tm and quanteda in that it allows generating self contained n-gram models. It also allows evaluating the model performance using Extrinsic and Intrinsic model evaluation.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

I think the package complies with the guidance around Ethics, Data Privacy and Human Subjects Research.
Any other questions or issues we should be aware of?:

I had developed the wordpredictor package as part of the Data Science Capstone course. The package was developed in order to fulfill the project requirements.

The main project requirement was to develop an application for predicting words. The application should function like the Microsoft Swift key application. See this online presentation for details of the project.

The main functionality provided by the wordpredictor package is word prediction. Here is an online demo showing a possible use for the package.

I would like to improve the wordpredictor package so others find it useful.

ropensci / software-review

wordpredictor: Develop Text Prediction Models Based on N-Grams #448

Scope