tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

can't handle embedded nulls in multibyte encoding; need for skipNull argument? #340

Closed rgayler closed 3 years ago

rgayler commented 3 years ago

I was trying out vroom v 1.4.0.9000 with an 8.8GB tab-delimited text file in UT-16LE encoding (because of r-lib/vroom@7d54cda ).

vroom() returns about half the number of expected rows and some of the character fields contain embedded tabs (the field delimiter!). I can read the same file accurately with read.delim(skipNull = TRUE) .

FWIW I am using: Ubuntu 20.10; R 4.1.0; RStudio 1.4.1717

I don't have a reprex: nobody wants an 8.8GB file and encodings are a mystery to me, so I don't know how to dummy up some appropriate data.

Examples:

- read.delim(skipNull = TRUE)

d_rd_skip <- read.delim(
  "data/utf16le.txt", fileEncoding = 'UTF-16LE',
  header = TRUE, sep = "\t", colClasses = "character",
  strip.white = TRUE, na.strings = "",
  skipNul = TRUE,
  quote = "", allowEscapes = FALSE, comment.char = ""
)

dim(d_rd_skip)
#> [1] 9671886      90
# Correct number of rows
# Eyeballing the data frame looks OK

- read.delim(skipNull = FALSE)

d_rd_noskip <- read.delim(
  "data/utf16le.txt", fileEncoding = 'UTF-16LE',
  header = TRUE, sep = "\t", colClasses = "character",
  strip.white = TRUE, na.strings = "",
  skipNul = FALSE,
  quote = "", allowEscapes = FALSE, comment.char = ""
)
#> Warning messages:
#> 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 2 appears to contain embedded nulls
#> 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 3 appears to contain embedded nulls
#> 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 4 appears to contain embedded nulls
#> 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 5 appears to contain embedded nulls
#> 5: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
#>   embedded nul(s) found in input

dim(d_rd_noskip)
#> [1] 9671886      90
# Correct number of rows
# Eyeballing the data frame shows many missing fields from phone_num onwards

read.delim() complains about embedded nulls, returns the correct number of rows, and some expected field values are missing. So skipNull = TRUE is needed for read.delim() to read this file correctly.

- vroom()

library(vroom) # version 1.4.0.9000

d_vr <- vroom(
  "data/utf16le.txt", locale = locale(encoding = "UTF-16LE"),
  col_names = TRUE, delim = "\t", col_types = cols(.default = col_character()),
  trim_ws = TRUE, na = "",
  quote = "", escape_double = FALSE, escape_backslash = FALSE, comment = ""
)

dim(d_vr)
#> [1] 4834776      90
# About half the number of rows there should be
# Eyeballing the data frame shows values in the wrong fields and character fields with embedded tabs (the field separator)

vroom() doesn't give any warnings and hasn't been able to split the characters correctly across the fields.

Many of the arguments to vroom() mirror the arguments to read.delim(), but there is no equivalent argument in vroom() to skipNull in read.delim(). Perhaps there should be.

Is it possible this issue is related to https://github.com/tidyverse/readr/issues/306 ?

jimhester commented 3 years ago

Sorry you will have to provide at least a small reproducible example to have any hope of a solution, can you just post the first 100 rows or something similar?

rgayler commented 3 years ago

can you just post the first 100 rows or something similar

Good idea. Two files attached utf-16le.txt utf-8.txt

It turns out that the multi-byte encoding is not relevant. I get the same results in UTF-16LE and UTF-8. The files are tab-delimited text and contain embedded nulls in some of the otherwise blank fields. So instead of adjacent tabs, sometimes there is a null between tabs. (No, I don't know why they're like that, but that's how they come from the data source. Sigh)

This appears to confuse vroom() so that it doesn't recognise those tabs as field delimiters and they get included verbatim as part of the value of some field. Consequently it detects many fewer fields than there actually are in the data file.

read.delim(skipNul = TRUE) gets the intended values from the files. read.delim(skipNul = FALSE) gets the correct number of rows and columns but has a lot of missing field values.

What I would like to see is the equivalent behaviour to read.delim(skipNul = TRUE) implemented in vroom()

# The file utf-16le.txt is the first 10 lines (head big-file > utf-16le.txt) of
# https://s3.amazonaws.com/dl.ncsbe.gov/data/Snapshots/VR_Snapshot_20081104.zip
# This data is published by the North Carolina State Board of Elections in accordance with North Carolina law.
# The NCSBE data dictionary says the file is encoded as UTF-16LE

# Make a UTF-8 copy of the file
# $ stuff <== run stuff on a linux terminal
$ iconv -f UTF-16LE -t UTF-8 utf-16le.txt > utf-8.txt
iconv: incomplete character or shift sequence at end of buffer
# Something odd is going on

# Check the encodings
$ uchardet utf*
utf-16le.txt: UTF-16
utf-8.txt: UTF-8

# Check the number of lines
$ wc -l utf*
10 utf-16le.txt
9 utf-8.txt
19 total
# Looks like iconv might have taken the last input line as malformed

# Open utf-16le.txt with the Kate text editor and select UTF-16 encoding
# Has 10 lines of data
# About half the blank fields after zip_code have a non-printable character in them,
# presumably a null

# Open utf-8.txt with the Kate text editor and select UTF-8 encoding
# Has 10 lines of data
# About half the blank fields after zip_code have a non-printable character in them,
# presumably a null

# Open utf-16le.txt with the GHex binary editor
# Last 8 bytes are 00 36 00 35 00 0D 00 0A (6 5 CR LF)
# There are plenty of nulls (00 00) between adjacent tabs (00 09)

# Open utf-8.txt with the GHex binary editor
# Last 3 bytes are 36 35 0D (6 5 CR)
# There are plenty of nulls (00) between adjacent tabs (09)

# So iconv has dropped the final LF, thus mangling the final newline,
# which explains why wc thinks there is one less line in utf-8.txt
# This shouldn't effect any issues seen in earlier lines.

# Read utf-16le.txt with read.delim(skipNul = TRUE)
d_rd16_skip <- read.delim(
  "utf-16le.txt", fileEncoding = 'UTF-16LE', skipNul = TRUE,
  header = TRUE, sep = "\t", colClasses = "character",
  strip.white = TRUE, na.strings = "",
  quote = "", allowEscapes = FALSE, comment.char = ""
)
dim(d_rd16_skip)
#> [1]  9 90

# Read utf-16le.txt with read.delim(skipNul = TRUE)
d_rd16_noskip <- read.delim(
  "utf-16le.txt", fileEncoding = 'UTF-16LE', skipNul = FALSE,
  header = TRUE, sep = "\t", colClasses = "character",
  strip.white = TRUE, na.strings = "",
  quote = "", allowEscapes = FALSE, comment.char = ""
)
#> Warning messages:
#> 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 2 appears to contain embedded nulls
#> 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 3 appears to contain embedded nulls
#> 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 4 appears to contain embedded nulls
#> 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
#>   line 5 appears to contain embedded nulls
#> 5: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
#>   embedded nul(s) found in input
dim(d_rd16_noskip)
#> [1]  9 90

# Read utf-16le.txt with vroom()
library(vroom) # version 1.4.0.9000
d_vr16 <- vroom(
  "utf-16le.txt", locale = locale(encoding = "UTF-16LE"),
  col_names = TRUE, delim = "\t", col_types = cols(.default = col_character()),
  trim_ws = TRUE, na = "",
  quote = "", escape_double = FALSE, escape_backslash = FALSE, comment = ""
)
dim(d_vr16)
#> [1]  4 90
problems(d_vr16)
#> # A tibble: 4 x 5
#> row   col expected     actual        file
#> <int> <int> <chr>        <chr>         <chr>
#> 1     2    55 "90 columns" 55 columns    ""
#> 2     3    55 "90 columns" 55 columns    ""
#> 3     4    90 ""           embedded null ""
#> 4     4   125 "90 columns" 125 columns   ""
#> Warning message:
#> One or more parsing issues, see `problems()` for details

# Comparing results
# read.delim(skipNul = TRUE) looks right
# read.delim(skipNul = FALSE) has right number of lines, but many values are missing
# install.packages("waldo)
library(waldo)
compare(d_rd16_skip, d_rd16_noskip)
#> `old$phone_num[1:7]`: "5846119" "2262036" "2267322" "2295238" "2274533" "3766297" "2288897"
#> `new$phone_num[1:7]`: NA        NA        NA        NA        "2274533" "3766297" "2288897"
#>
#> `old$race_code[1:7]`: "W" "W" "W" "W" "B" "W" "W"
#> `new$race_code[1:7]`: NA  NA  NA  NA  "B" "W" "W"
#>
#> `old$race_desc[1:7]`: "WHITE" "WHITE" "WHITE" "WHITE" "BLACK or AFRICAN AMERICAN" "WHITE" "WHITE"
#> `new$race_desc[1:7]`: NA      NA      NA      NA      "BLACK or AFRICAN AMERICAN" "WHITE" "WHITE"
#>
#> `old$ethnic_code[1:7]`: "NL" "NL" "NL" "NL" "NL" "NL" "NL"
#> `new$ethnic_code[1:7]`: NA   NA   NA   NA   "NL" "NL" "NL"
#>
#> old$ethnic_desc              | new$ethnic_desc
#> [1] "NOT HISPANIC or NOT LATINO" - NA                           [1]
#> [2] "NOT HISPANIC or NOT LATINO" - NA                           [2]
#> [3] "NOT HISPANIC or NOT LATINO" - NA                           [3]
#> [4] "NOT HISPANIC or NOT LATINO" - NA                           [4]
#> [5] "NOT HISPANIC or NOT LATINO" | "NOT HISPANIC or NOT LATINO" [5]
#> [6] "NOT HISPANIC or NOT LATINO" | "NOT HISPANIC or NOT LATINO" [6]
#> [7] "NOT HISPANIC or NOT LATINO" | "NOT HISPANIC or NOT LATINO" [7]
#>
#> `old$party_cd[1:7]`: "DEM" "REP" "REP" "REP" "DEM" "DEM" "REP"
#> `new$party_cd[1:7]`: NA    NA    NA    NA    "DEM" "DEM" "REP"
#>
#> `old$party_desc[1:7]`: "DEMOCRATIC" "REPUBLICAN" "REPUBLICAN" "REPUBLICAN" "DEMOCRATIC" "DEMOCRATIC" "REPUBLICAN"
#> `new$party_desc[1:7]`: NA           NA           NA           NA           "DEMOCRATIC" "DEMOCRATIC" "REPUBLICAN"
#>
#> `old$sex_code[1:7]`: "M" "F" "M" "F" "M" "M" "M"
#> `new$sex_code[1:7]`: NA  NA  NA  NA  "M" "M" "M"
#>
#> `old$sex[1:7]`: "MALE" "FEMALE" "MALE" "FEMALE" "MALE" "MALE" "MALE"
#> `new$sex[1:7]`: NA     NA       NA     NA       "MALE" "MALE" "MALE"
#>
#> `old$age[1:7]`: "61" "58" "102" "65" "88" "59" "68"
#> `new$age[1:7]`: NA   NA   NA    NA   "88" "59" "68"
#>
#> And 23 more differences ...

# Comparing results
# read.delim(skipNul = TRUE) looks right
# vroom() has fewer lines, but many values are missing
compare(d_rd16_skip, d_vr16)
#> `class(old)`:                              "data.frame"
#> `class(new)`: "spec_tbl_df" "tbl_df" "tbl" "data.frame"
#>
#> `attr(old, 'row.names')[2:9]`: 2 3 4 5 6 7 8 9
#> `attr(new, 'row.names')[2:4]`: 2 3 4
#>
#> `attr(old, 'problems')` is absent
#> `attr(new, 'problems')` is a pointer
#>
#> `attr(old, 'spec')` is absent
#> `attr(new, 'spec')` is an S3 object of class <col_spec>, a list
#>
#> `old$snapshot_dt[2:9]`: "2008-11-04" "2008-11-04" "2008-11-04" "2008-11-04" "2008-11-04" "2008-11-04" "2008-11-04" "2008-11-04"
#> `new$snapshot_dt[2:4]`: "2008-11-04" "2008-11-04" "2008-11-04"
#>
#> `old$county_id[2:9]`: "1" "1" "1" "1" "1" "1" "1" "1"
#> `new$county_id[2:4]`: "1" "1" "1"
#>
#> `old$county_desc[2:9]`: "ALAMANCE" "ALAMANCE" "ALAMANCE" "ALAMANCE" "ALAMANCE" "ALAMANCE" "ALAMANCE" "ALAMANCE"
#> `new$county_desc[2:4]`: "ALAMANCE" "ALAMANCE" "ALAMANCE"
#>
#> `old$voter_reg_num`: "000000004700" "000000006900" "000000007550" "000000008800" "000000008925" "000000013200" "000000014000" "000000022550" "000000027525"
#> `new$voter_reg_num`: "000000004700" "000000006900" "000000007550"                               "000000013200"
#>
#> `old$ncid`: "AA2055" "AA2062" "AA2066" "AA2073" "AA2077" "AA2112" "AA2116" "AA2169" "AA2214"
#> `new$ncid`: "AA2055" "AA2062" "AA2066"                   "AA2112"
#>
#> `old$status_cd`: "A" "A" "R" "A" "R" "R" "R" "R" "A"
#> `new$status_cd`: "A" "A" "R"     "R"
#>
#> And 95 more differences ...

# Look at this difference, embedded tabs (field separators) in a field value
# Many field have been concatenated into a single field value
compare(d_rd16_skip["mail_addr3"], d_vr16["mail_addr3"])
#> `class(old)`:                "data.frame"
#> `class(new)`: "tbl_df" "tbl" "data.frame"
#>
#> `attr(old, 'row.names')[2:9]`: 2 3 4 5 6 7 8 9
#> `attr(new, 'row.names')[2:4]`: 2 3 4
#>
#> old$mail_addr3 | new$mail_addr3
#> [1] NA             - "5846119\tW\tWHITE\tNL\tNOT HISPANIC or NOT LATINO\tDEM\tDEMOCRATIC\tM\tMALE\t61\tNC\t1974-10-04\t03C\tCENTRAL BOONE\tBUR\tBURLINGTON" [1]
#> [2] NA             - "2262036\tW\tWHITE\tNL\tNOT HISPANIC or NOT LATINO\tREP\tREPUBLICAN\tF\tFEMALE\t58\tNY\t1973-10-08\t06E\tEAST GRAHAM\tGRA\tGRAHAM"     [2]
#> [3] NA             - "2267322\tW\tWHITE\tNL\tNOT HISPANIC or NOT LATINO\tREP\tREPUBLICAN\tM\tMALE\t102\tNC\t1968-04-20"                                     [3]
#> [4] NA             - "GRAHAM\tNC\t27253\t \t3766297\tW\tWHITE\tNL\tNOT HISPANIC or NOT LATINO\tDEM\tDEMOCRATIC\tM\tMALE\t59\tNC\t1982-04-02"                [4]
#> [5] NA             -
#> [6] NA             -
#> [7] NA             -
#> [8] NA             -
#> [9] NA             -

# Now try reading the UTF-8 version of the input with vroom()
d_vr8 <- vroom(
  "utf-8.txt", locale = locale(encoding = "UTF-8"),
  col_names = TRUE, delim = "\t", col_types = cols(.default = col_character()),
  trim_ws = TRUE, na = "",
  quote = "", escape_double = FALSE, escape_backslash = FALSE, comment = ""
)
dim(d_vr8)
#> [1]  4 90
problems(d_vr8)
#> # A tibble: 4 x 5
#> row   col expected     actual        file
#> <int> <int> <chr>        <chr>         <chr>
#> 1     2    55 "90 columns" 55 columns    ""
#> 2     3    55 "90 columns" 55 columns    ""
#> 3     4    90 ""           embedded null ""
#> 4     4   125 "90 columns" 125 columns   ""
#> Warning message:
#> One or more parsing issues, see `problems()` for details

# Compure the UTF-16 and UTF-8 results read with vroom()
compare(d_vr16, d_vr8)
#> ✓ No differences
jimhester commented 3 years ago

The way that vroom handles null fields is basically the same way that readr handles them, e.g. it truncates the field when a null is encountered and ignores the rest of the field.

In order to replicate the read.delim behavior I would suggest you strip the UTF-16LE Null code points before reading. You can do this in a streaming fashion with vroom using a pipe connection pipe() with a perl one liner to remove the null code points. After doing this the results seem to match those of read.delim on the small example you gave (after a minor change to vroom to allow it to support passing a pipe() and re-encoding 3e39bad854ddb66e5744f8b64f7f6d2f0b8f86a1). e.g.

f <- "~/data/utf-16le.txt"
if (!file.exists(f)) {
  download.file("https://github.com/r-lib/vroom/files/6567697/utf-16le.txt", f)
}

library(vroom)

d_rd16_skip <- read.delim(
  f, fileEncoding = "UTF-16LE", skipNul = TRUE,
  header = TRUE, sep = "\t", colClasses = "character",
  strip.white = TRUE, na.strings = "",
  quote = "", allowEscapes = FALSE, comment.char = ""
)

con <- pipe(sprintf("perl -pe 's/\\000\\000//g' < %s", f))

d_vr16 <- vroom(
  con, locale = locale(encoding = "UTF-16LE"),
  col_names = TRUE, delim = "\t", col_types = cols(.default = col_character()),
  trim_ws = TRUE, na = "",
  quote = "", escape_double = FALSE, escape_backslash = FALSE, comment = ""
)

testthat::expect_equal(as.data.frame(d_vr16), d_rd16_skip)

Created on 2021-06-01 by the reprex package (v2.0.0)

jimhester commented 3 years ago

Also note that vroom re-encodes the full file in the background to your temp directory every time you read the file. If you are reading this file many times you may want to re-encode the file (and remove the null code points) upfront to UTF-8 yourself with the iconv command line tool or other means prior to trying to read it with vroom.

rgayler commented 3 years ago

it truncates the field when a null is encountered and ignores the rest of the field.

Although the fact that it's yielding about half as many fields as there are tabs in the file and that it has produced some field values with embedded tabs suggests it's not doing the truncate and ignore quite as I would expect.

In order to replicate the read.delim behavior I would suggest ...

All excellent suggestions. Thanks, @jimhester .

My current project that provoked this issue is an attempt at reproducible science, so I was hoping to do all the processing in R to maximise the portability across platforms (that I don't have access to). I think I'll just declare the input file to be unpleasant, fix it outside R with linux shell commands (to convert to UTF-8 and delete all nulls), and leave the readers to sort it out for themselves if they need to reproduce it on a non-linux platform.