r-devel / r-dev-day

Repo to organize tasks for R Dev Days
7 stars 1 forks source link

tools::toTitleCase() (specific case from #24) #32

Open shannonpileggi opened 1 month ago

shannonpileggi commented 1 month ago

Breaking this out from issue #24

tools::toTitleCase() incorrectly capitalizes conjunctions (e.g. 'and') when using suspensive hyphenation https://bugs.r-project.org/show_bug.cgi?id=18674

shannonpileggi commented 1 month ago

From the bugzilla report:

# this behavior is incorrect
tools::toTitleCase("small- and large-scale analysis")
"Small- And Large-Scale Analysis"

# this behavior is correct
tools::toTitleCase("small-scale and large-scale analysis")
"Small-Scale and Large-Scale Analysis"

# this behavior is correct
tools::toTitleCase("small and large-scale analysis")
"Small and Large-Scale Analysis"

Documenting existing behavior, as existing unit tests are minimal.

tools::toTitleCase("bayesian network modeling and analysis")
"Bayesian Network Modeling and Analysis"

tools::toTitleCase("ensemble tool for predictions from species distribution models")
"Ensemble Tool for Predictions from Species Distribution Models"

tools::toTitleCase("a 'shiny' app for weibull analysis from 'weibullr'")
"a 'shiny' App for Weibull Analysis from 'weibullr'"

tools::toTitleCase("post-estimation functions for generalized linear mixed models")
"Post-Estimation Functions for Generalized Linear Mixed Models"

tools::toTitleCase("post- and post-estimation functions for generalized linear mixed models")
"Post- And Post-Estimation Functions for Generalized Linear Mixed Models"

tools::toTitleCase("general network (http/ftp/...) client interface for r")
"General Network (Http/Ftp/...) Client Interface for r"

tools::toTitleCase("i love r")
"i Love r"

tools::toTitleCase("r and i")
"r and i"

tools::toTitleCase("above the bar")
"above the Bar"

tools::toTitleCase("the bar is above")
"The Bar is above"
shannonpileggi commented 1 month ago

Suggested tests to add with their current behavior

# incorrect
tools::toTitleCase("i love r")
"i Love r"

# incorrect
tools::toTitleCase("r and i")
"r and i"

# incorrect
tools::toTitleCase("a new bug")
"a New Bug"

# correct
tools::toTitleCase("in a minute")
"In a Minute"

# correct
tools::toTitleCase("pre and post estimation")
"Pre and Post Estimation"

# incorrect
tools::toTitleCase("pre- and post estimation")
"Pre- And Post Estimation"
Bisaloo commented 1 month ago

From @shannonpileggi's test cases, we noticed the unexpected behaviour of single letter words (including 'a') not being capitalized, even at the beginning of a sentence.

This is at odds with the comments in the function source code:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2662-L2663

    ## These should be lower case except at the beginning (and after :)
    lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"

There is indeed an instruction to capitalize the first word of the sentence:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2687

        l[1L] <- FALSE

But then a later instruction explicitly excludes single-letter words from the transformation:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2696

        keep <- havecaps | l | (nchar(xx) == 1L) | alone

A potential solution is to special-case the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before alone exclusions are applied, to ensure alone words are always left alone, even if they are in first position.

This leads to the following patch:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
         alone <- alone | grepl("^'.*'$", xx)
         havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
         l <- grepl(lpat, xx, ignore.case = TRUE)
-        l[1L] <- FALSE
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
         l[ind + 1L] <- FALSE
         xx[l] <- tolower(xx[l])
-        keep <- havecaps | l | (nchar(xx) == 1L) | alone
+        keep <- havecaps | l | (nchar(xx) == 1L)
+        keep[1L] <- FALSE
+        keep <- keep | alone
         xx[!keep] <- sapply(xx[!keep], do1)
         paste(xx, collapse = "")
Bisaloo commented 1 month ago

The patch mentioned above doesn't completely harmonize the function behaviour with the comment. 'a' after ':' is still not capitalized:

tools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music"

Created on 2024-07-12 with reprex v2.1.1

Bisaloo commented 1 month ago

Patch from @sarahzeller for the original issue:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+        # don't capitalize lpat words after hyphenation
+        ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
         l[ind + 2L] <- FALSE
         ## Also after " (e.g. "A Book Title")
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
gmbecker commented 1 month ago

What is R CMD check using to calculate the correct title case. Perhaps we can simply use the same code there in toTitleCase...

Bisaloo commented 1 month ago

R CMD check is using toTitleCase():

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/QC.R#L7848-L7871

    ## Check Title field.
    title <- trimws(as.vector(meta["Title"]))
    title <- gsub("[\n\t]", " ", title)
    package <- meta["Package"]
    if (tolower(title) == tolower(package)) {
        out$title_is_name <- TRUE
    } else {
        if(grepl(paste0("^",
                        gsub(".", "[.]", package, fixed = TRUE),
                        "[ :]"), title, ignore.case = TRUE))
            out$title_includes_name <- TRUE
        language <- meta["Language"]
        if(is.na(language) ||
           (language == "en") ||
           startsWith(language, "en-")) {
            title2 <- toTitleCase(title)
            ## Keep single quoted elements unchanged.
            p <- "(^|(?<=[ \t[:punct:]]))'[^']*'($|(?=[ \t[:punct:]]))"
            m <- gregexpr(p, title, perl = TRUE)
            regmatches(title2, m) <- regmatches(title, m)
            if(title != title2)
                out$title_case <- c(title, title2)
        }
    }
gmbecker commented 1 month ago

Didn't the bug report indicate that after using title case R CMD check complained that it wasn't in title case?

sarahzeller commented 1 month ago

The issue we're aiming to solve the following issue: After a hyphen, words that should not be capitalized are being capitalized. Examples include and, or, to, as specified in lpat earlier in the toTitleCase().

In the code, there are already lines checking for words following hyphens, to make sure they are capitalized. We add a check in this part, where we exclude those specific cases of “forbidden” words following a hyphen. This prevents such words from being incorrectly capitalized.

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2692,6 +2692,8 @@
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
+        # don't capitalize lpat words after hyphenation
+        ind <- ind[!(xx[ind] == "-" & grepl(lpat, xx[ind + 2L]))]
         l[ind + 2L] <- FALSE
         ## Also after " (e.g. "A Book Title")
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
Bisaloo commented 1 month ago

Proposal for a new bug report / discussion on bugzilla:


We noticed some edge cases in toTitleCase() that produced unexpected (to us!) results.

Mismatch between code comment and actual behaviour

'A' is not capitalized at the beginning of a sentence of after a colon.

This is at odds with the comments in the function source code:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2662-L2663

    ## These should be lower case except at the beginning (and after :)
    lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"

'A' is not capitalized at the beginning of a sentence

tools::toTitleCase("a new bug")
#> [1] "a New Bug"

There is indeed an instruction to capitalize the first word of the sentence:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2687

        l[1L] <- FALSE

But then a later instruction explicitly excludes single-letter words from the transformation:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2696

        keep <- havecaps | l | (nchar(xx) == 1L) | alone

A potential solution is to handle the first word of the sentence later in the logic, so it's not overwritten by following exclusions. We still want it to happen before alone exclusions are applied, to ensure alone words are always left alone, even if they are in first position.

This leads to the following patch:

Index: src/library/tools/R/utils.R
===================================================================
--- src/library/tools/R/utils.R (revision 86796)
+++ src/library/tools/R/utils.R (working copy)
@@ -2688,7 +2688,6 @@
         alone <- alone | grepl("^'.*'$", xx)
         havecaps <- grepl("^[[:alpha:]].*[[:upper:]]+", xx)
         l <- grepl(lpat, xx, ignore.case = TRUE)
-        l[1L] <- FALSE
         ## do not remove capitalization immediately after ": " or "- "
         ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
         ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
@@ -2697,7 +2696,9 @@
         ind <- which(xx == '"'); ind <- ind[ind + 1L <= length(l)]
         l[ind + 1L] <- FALSE
         xx[l] <- tolower(xx[l])
-        keep <- havecaps | l | (nchar(xx) == 1L) | alone
+        keep <- havecaps | l | (nchar(xx) == 1L)
+        keep[1L] <- FALSE
+        keep <- keep | alone
         xx[!keep] <- sapply(xx[!keep], do1)
         paste(xx, collapse = "")

Note that this patch would capitalize all single-letter words at the beginning of a sentence, not just 'A'. We are not completely sure if this is the desired behaviour or not.

'A' is not capitalized after a colon

tools::toTitleCase("Salzburg: a city of music")
#> [1] "Salzburg: a City of Music"

This happens for the same reason, where single-letter words get a special treatment late in the logic. The only difference with the previous section is where the custom logic to capitalize the first word after a colon is applied:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2689-L2691

        ## do not remove capitalization immediately after ": " or "- "
        ind <- grep("[-:]$", xx); ind <- ind[ind + 2L <= length(l)]
        ind <- ind[(xx[ind + 1L] == " ") & grepl("^['[:alnum:]]", xx[ind + 2L])]
        l[ind + 2L] <- FALSE

Special either exclusion list words are not capitalized at the beginning of a sentence or after a colon

At the moment, words in alone:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2656-L2661

    ## leave these alone: the internal caps rule would do that
    ## in some cases.  We could insist on this exact capitalization.
    alone <- c("2D", "3D", "AIC", "BayesX", "GoF", "HTML", "LaTeX",
               "MonetDB", "OpenBUGS", "TeX", "U.S.", "U.S.A.", "WinBUGS",
               "aka", "et", "al.", "ggplot2", "i.e.", "jar", "jars",
               "ncdf", "netCDF", "rgl", "rpart", "xls", "xlsx")

and either:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2665-L2670

    ## These we don't care about
    either <- c("all", "above", "after", "along", "also", "among",
                "any", "both", "can", "few", "it", "less", "log",
                "many", "may", "more", "over", "some", "their",
                "then", "this", "under", "until", "using", "von",
                "when", "where", "which", "will", "without",
                "yet", "you", "your")

are treated the same way:

https://github.com/r-devel/r-svn/blob/09f3acb92dcc180636a7c4e257491302e9a22ae8/src/library/tools/R/utils.R#L2683

        alone <- xx %in% c(alone, either)

But this leads to situations where words in either are not capitalized at the beginning of a sentence or after a colon:

tools::toTitleCase("more than writing code")
#> [1] "more than Writing Code"

We may want to separate the logic for alone and either and ensure special handling happens at the correct location for each of them.

In particular, we may want to ensure either words are capitalized at the start of the sentence, while we probably never want this for alone words.

reikookamoto commented 1 month ago

As mentioned in previous posts, the code has a vector named alone. This vector includes a mix of technical words that are in mixed case and others that are not usually capitalized. However, it is not comprehensive and misses important tools like R. While I am not sure of the best solution, the current setup appears incomplete and difficult to maintain.

Bisaloo commented 1 month ago

Quoted words are not protected from capitalization change when followed by a comma:

tools::toTitleCase("Import and Export 'SPSS', 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'Spss', 'Stata' and 'SAS' Files"

tools::toTitleCase("Import and Export 'SPSS' 'Stata' and 'SAS' Files")
#> [1] "Import and Export 'SPSS' 'Stata' and 'SAS' Files"

Created on 2024-07-12 with reprex v2.1.1

shannonpileggi commented 1 month ago

A PR has been made to r-svn that

1) fixes the intially reported bug 2) fixes a spaced in the help documentation 3) adds two examples to the help documentation 4) adds two tests regarding the bug fixed

https://github.com/r-devel/r-svn/pull/172

shannonpileggi commented 1 month ago

Hi! Martin has committed changes here: https://github.com/r-devel/r-svn/commit/cd556f5b4264b89e8691cf08bd7b8e20d36fe75a

Does anyone want take lead on summarizing other potential bugs and adding that to the bugzilla report for consideration?

Bisaloo commented 1 month ago

We still have https://github.com/r-devel/r-svn/pull/174 on hold on bugzilla and I'm happy to post this comment about capitalization at the start of a string as a new report: https://github.com/r-devel/r-dev-day/issues/32#issuecomment-2225482167

@shannonpileggi, please let me know if the phrasing in https://github.com/r-devel/r-dev-day/issues/32#issuecomment-2225482167 seems good to you or if you'd tweak anything. If it sounds good, a :+1: reaction on the message to confirm your agreement would be great!

shannonpileggi commented 1 month ago

Yes, it is very through and looks great!! Please post @Bisaloo 🙏

Bisaloo commented 1 month ago

Thanks, it is now submitted as https://bugs.r-project.org/show_bug.cgi?id=18767.