tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

Column row/number seems not to correspond to original data #479

Closed jwhendy closed 1 year ago

jwhendy commented 1 year ago

I can't use my original data as it's confidential, but wanted to give a general idea as perhaps someone can clarify if I'm doing something wrong. If not, give me some time to try and reproduce via fake data.

At the moment... this is work related, I'm new to this package, and I'd like to understand if I'm missing something silly.

I was running a script which reads in a CSV, and saw an error:

Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 

From the docs, I see that problem() returns:

A data frame with one row for each problem and four columns: row,col - Row and column number that caused the problem, referencing the original input

In my script, I pass col_types, and so I recreated the read (currently with readr::read_csv():

library(vroom)
col_types <- "TciciccccdcicclicdiildicdiTT"
dat <- vroom(file_path, col_types = col_types)

Now I do problems(dat) and see the first row (data obscured):

> problems(dat)
# A tibble: 31 × 5
       row   col expected   actual  file                                                                             
     <int> <int> <chr>      <chr>    <chr>                                                                            
 1 1647868    20 an integer a_string  C:/path_to_file…

However, column 20 is indeed defined as i, and I find:

> dat[, 20] %>% unique()
# A tibble: 2 × 1
   col_name
  <int>
1    NA
2    16

I recognize the format of a_string, which would put it in column 7, not 20. This is a pretty massive file. Is this somehow about delimiters or e.g. newlines causing something to bump onto the next line?

I read this in again, but used skip=1600000 and then saved it out. The first error is on row 47868, and when I look, everything seems as it should. a_string shown in the output of problems() is in Column G.

Let me know if something stands out I should check into further. It seems like a false alarm, but I wanted to try and investigate to make sure I wasn't missing something which would affect my results. Thanks!

jennybc commented 1 year ago

Hard to say and this doesn't immediately tweak my spidey sense about being connected to some known issue.

Have you looked at that specific line in the raw file or its general neighborhood to see if there's anything "interesting" about it?

jwhendy commented 1 year ago

@jennybc thanks for the skim, and I figured this would probably be a hard one to say much about. I wish I could just attach my file! I didn't see anything interesting. The file was too big to open directly in LibreOffice, so I had do the trick mentioned:

I read this in again, but used skip=1600000 and then saved it out. The first error is on row 47868, and when I look, everything seems as it should. a_string shown in the output of problems() is in Column G.

Could we start simple since I'm new to vroom? These are all just various ways of asking the same thing just to make sure I actually understand the error output correctly: