karthik / testdat

A package to run unit tests on tabular data
142 stars 20 forks source link

test_na and fix_na for levels as white space #31

Open jsonbecker opened 10 years ago

jsonbecker commented 10 years ago

One thing I run into a bunch is a blank field (most often with white space) used as missing. This is especially annoying with factors, which then creates a level for the blank space.

Currently, white space alone is not considered a NA_aliases (see here).

Should test_na and fix_na be updated to treat white space as missing, or perhaps should there be a new function that tests for empty levels or blank fields and the fix modifies to NA?

I'm happy to contribute to implement either.

karthik commented 10 years ago

Good question @jasonpbecker Can you give me an example of when this happens? By default R should fill in NAs whenever it encounters a empty cell.

x1,x2,x3
4,1,3
5,,
6,3,234

If I read this .csv file into R, it will automatically convert blank fields to NA.

> (x <- read.csv("~/Desktop/temp.csv"))
  x1 x2  x3
1  4  1   3
2  5 NA  NA
3  6  3 234

I would really appreciate an example of this " This is especially annoying with factors, which then creates a level for the blank space."

jsonbecker commented 10 years ago

So if you read this file:

foo, bar,,,,2014-09-10, 50.00
baz, bat, ,,2014-09-10, 2014-09-09, 105.00
foo, bat,6103914,,,2014-09-10, 5.00
> read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE)

   V1   V2      V3 V4         V5          V6  V7
1 foo  bar      NA NA             2014-09-10  50
2 baz  bat      NA NA 2014-09-10  2014-09-09 105
3 foo  bat 6103914 NA             2014-09-10   5

Classes and values for V5:

> sapply(read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE), class)
         V1          V2          V3          V4          V5 
"character" "character"   "integer"   "logical" "character" 
         V6          V7 
"character"   "numeric" 
> table(read.csv('~/Desktop/test.csv', header=FALSE, stringsAsFactors=FALSE)$V5)

           2014-09-10 
         2          1 

If you don't use stringsAsFactors=FALSE, you get a similar result but the white space is now a level in the factor for V5, etc.