karthik / testdat

A package to run unit tests on tabular data
142 stars 20 forks source link

Potential fix for test_utf8 not working on some data frames #28

Closed christopher-beckham closed 10 years ago

christopher-beckham commented 10 years ago

Hi there,

First Git pull request - hopefully I didn't mess this up. I noticed that test_utf8 failed on some data frames, giving this error:

Error in charToRaw(x) : argument must be a character vector of length 1

The error is pretty intuitive, although I still couldn't find what was causing this. All the column names are strings so I've no idea what's happening there...

In my fork of the repo I added a file in inst/extdata called minimal.csv, if you load this CSV into R and run test_utf8 on it you will get the error. There was also another dataset I used, which worked fine, but when I reduced its dimensions to be the same as minimal.csv, I got the same error.

Despite not quite finding what the problem was, I decided to re-write one of the lines in test_utf8.R. Also no warnings need to be suppressed for this new line. :)

Cheers

karthik commented 10 years ago

Thanks for the PR! I'll review this in the next day or two and get back to you. Cheers.

christopher-beckham commented 10 years ago

Thanks. Whoops, just realized my second commit is also showing up in this pull request, didn't think that would happen. You can choose to ignore that if you want, I added a method to test for duplicate column names, which I assume hasn't been written yet.

karthik commented 10 years ago

@chrispy645 No worries. FYI, if you keep adding commits before a PR is merged, it will be added to the PR.

karthik commented 10 years ago

Hi @chrispy645 Thanks for this PR. I'll accept the first edit (makes sense) but I'll skip the second one since R will automatically deal with duplicate names (or at least much of that functionality already exists in base).

Example: This csv:

x,x,x
1,2,3
1,2,3
1,2,3

when read becomes:

read.csv('test_data.csv')
> data
  x x.1 x.2
1 1   2   3
2 1   2   3
3 1   2   3
christopher-beckham commented 10 years ago

Cool, thanks! Yeah, although it is true R will deal with duplicate column names when reading in a CSV, it's actually possible to change the column names after it's been loaded, e.g:

colnames(data)[1] = "x"
colnames(data)[2] = "x"

And it turns out, you can actually then save this to an .RData file, load it back in, and the duplicate column names will remain (at least when I do this through RStudio, anyway). Hence why I added the test. So do you still think it would be useful, or does base check this somehow?

(I think this very issue was in the second assignment of Jeff's data analysis course, which is where I got the idea from.)

Cheers

karthik commented 10 years ago

it's actually possible to change the column names after it's been loaded, e.g:

Hi @chrispy645, yes I'm aware. The rationale behind this package is that researchers load data from spreadsheets (xls) or csv that don't necessarily check for these issues. Once read into R, a user will fix them and then store the data back as flat files or as temporary Rdata files. Data never begin this way. So we're hoping to help researchers quickly identify these issues as the data are read in from the raw format.

Someone introducing problems back into data before storing as intermediate .Rdata files is an odd use case. So let me mull over it and if it turns out to still be useful, I'll retrieve it from your earlier commit.

Cheers

christopher-beckham commented 10 years ago

No worries. It is indeed an odd use case, so I understand. :)