trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

scrubber pastes single last letter to previous text. #207

Closed trinker closed 9 years ago

trinker commented 9 years ago

Sent as an email by @FabrizioMaccallini

Great package, thank you! Something I found using R that may be a bug. Given a string:

text <- "This in a"

# each of the following functions:
text <- replace_contraction(text)
text <- replace_abbreviation(text)
text <- scrubber(text)

# are going to return:
"This ina"

It seems that any stand alone letter at the end is pasted to the previous word. I only used these three functions but that may affect other functions.

trinker commented 9 years ago

@FabrizioMaccallini Thanks for the issue. It's not a bug per se because qdap has a strict endmark philosophy. But it revealed pretty poor coding choices. Bottom line I altered the behavior to be more specific. You can install the development version via:

if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load_gh("trinker/qdap")

This eliminates the issue but you will continue to run into problems with other qdap functions. Here's the longer explanation:

This behavior is because qdap assumes the user has punctuation in each line of text. So if this is an incomplete sentence we should use | which results in the desired output:

> text <- "This in a|"
> 
> replace_contraction(text)
[1] "This in a|"
> replace_abbreviation(text)
[1] "This in a|"
> scrubber(text)
[1] "This in a|"

The analysis of punctuation is integral to the qdap package. Many functions rely on this behavior in algorithms and will typically throw a warning. If this had been processed by sentSplit we'd see:

sentSplit(data.frame(text = "This in a", stringsAsFactors = FALSE), "text")
  tot text
1 1.1 <NA>
Warning message:
In sentSplit(data.frame(text = "This in a", stringsAsFactors = FALSE),  :
  The following problems were detected:
missing ending punctuation

*Consider running `check_text`

That being said the following line is the culprit:

    if (fix.space) {
        x <- paste0(Trim(substring(x, 1, ncx - 1)), substring(x, ncx))
    }

This makes some pretty big assumptions and is not very controlled. This line has been replaced with a regex as follows:

    if (fix.space) {
        x <- Trim(gsub("(\\s+)([.?|!,]+)$", "\\2", x))
    }

This is a more satisfying approach. Again thank you for raising this issue. I have attributed credit to you in the NEWS.md file: https://github.com/trinker/qdap/blob/master/NEWS.md