Closed trinker closed 9 years ago
@FabrizioMaccallini Thanks for the issue. It's not a bug per se because qdap has a strict endmark philosophy. But it revealed pretty poor coding choices. Bottom line I altered the behavior to be more specific. You can install the development version via:
if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load_gh("trinker/qdap")
This eliminates the issue but you will continue to run into problems with other qdap functions. Here's the longer explanation:
This behavior is because qdap assumes the user has punctuation in each line of text. So if this is an incomplete sentence we should use |
which results in the desired output:
> text <- "This in a|"
>
> replace_contraction(text)
[1] "This in a|"
> replace_abbreviation(text)
[1] "This in a|"
> scrubber(text)
[1] "This in a|"
The analysis of punctuation is integral to the qdap package. Many functions rely on this behavior in algorithms and will typically throw a warning. If this had been processed by sentSplit
we'd see:
sentSplit(data.frame(text = "This in a", stringsAsFactors = FALSE), "text")
tot text
1 1.1 <NA>
Warning message:
In sentSplit(data.frame(text = "This in a", stringsAsFactors = FALSE), :
The following problems were detected:
missing ending punctuation
*Consider running `check_text`
That being said the following line is the culprit:
if (fix.space) {
x <- paste0(Trim(substring(x, 1, ncx - 1)), substring(x, ncx))
}
This makes some pretty big assumptions and is not very controlled. This line has been replaced with a regex as follows:
if (fix.space) {
x <- Trim(gsub("(\\s+)([.?|!,]+)$", "\\2", x))
}
This is a more satisfying approach. Again thank you for raising this issue. I have attributed credit to you in the NEWS.md file: https://github.com/trinker/qdap/blob/master/NEWS.md
Sent as an email by @FabrizioMaccallini