karthik / testdat

A package to run unit tests on tabular data
142 stars 20 forks source link

Add function to deal with UTF-8 characters #2

Open karthik opened 10 years ago

hadley commented 10 years ago

Here's something that I wrote a couple of years ago:

non_ascii <- function(x) {
  any(charToRaw(x) > 0x7F)
}

0x7F = 127 - any values higher than that imply non-ascii.

karthik commented 10 years ago

working on this now

harrysouthworth commented 9 years ago

According to it's roxygen

' This test will check every column in a data.frame for possible unicode characters.

But doesn't it just test the column names, not the contents of the columns? ut8 <- simplify2array(lapply(colnames(dat),non_ascii))

I just wrote the following which I /think/ check the contents of the columns. (Presumably, you do want to test the column names as well, though.) It's just a wrapper for Hadley's function.

non_ascii_cols <- function(x) { x <- x[, sapply(x, function(X) is.character(X) | is.factor(X))] x[, sapply(x, is.factor)] <- apply(x[, sapply(x, is.factor)], 2, as.character)

res <- sapply(x, function(X) apply(matrix(X, ncol=1), 1, function(Z) non_ascii(Z) )) apply(res, 2, function(X) sum(X) > 0) }