tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

unexpected altering of data when guessing column data types on data import with vroom() #502

Closed k-hench closed 1 year ago

k-hench commented 1 year ago

Hello :wave: ,

I stumbled on what I believe to be a bug, or at least quite some dangerous behavior where in the data import with vroom::vroom() the data are altered in unexpected (for me at least) ways. The issue is reminds me a little of the infamous re-formatting of data in the Excell:

In a nutshell, in cells with aggregated data (eg in the form 1,0), vroom() in some cases appears drop the , character and then interpret the concatenated value (10) as numeric. However, this does not appear to happen in a consistent way, such that sometimes 0,1 is actually interpreted as "0,1" (character). I believe that this depends on the leading value being a 0 vs. [1-9].

This behavior can be avoided when specifying the column types (vroom(..., col_types = "<types>")), however I believe that the default guess mode likely used frequently enough to raise the issue.

If the descried behavior is actually just a description of vroom() working as intended and this is just a case of rtfm, then please never mind. (However, having used vroom() for many years and yet being surprised by this myself, I hope that this heads-up from the user-perspective might still be of value.)

Minimal example

library(vroom)
packageVersion("vroom")
#> [1] '1.6.3.9000'

Creating minimal data set

write.table(data.frame(foo = "1,0", bar = "0,1"),
            file = "~/Downloads/wtf1.tsv",
            sep = "\t",
            quote = FALSE,
            row.names = FALSE)

The exported file (wtf1.tsv) should look like this:

foo bar
1,0 0,1

Native R data import works as expected

read.table("~/Downloads/wtf1.tsv", header = TRUE)
#>   foo bar
#> 1 1,0 0,1

On import with vroom::vroom() and automated data type detection, the , character is omitted and the value altered from 1,0 to 10:

vroom("~/Downloads/wtf1.tsv", delim = "\t")
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): bar
#> num (1): foo
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>     foo bar  
#>   <dbl> <chr>
#> 1    10 0,1

Importing the data with specified column types ("character") does not alter the data however:

vroom("~/Downloads/wtf1.tsv", delim = "\t", col_types = "c")
#> # A tibble: 1 × 2
#>   foo   bar  
#>   <chr> <chr>
#> 1 1,0   0,1

Minimal 'real-life' example

I stumbled on the behavior while using readr::read_tsv(), which I believe uses vroom() under the hood. The original data is a summary file produced by the genomics software gatk, which is very widely used through genomics community.

A slimmed down version of the original file looks like this (wtf2.tsv):

CHROM   POS TYPE    105388.AD   105391.AD   105407.AD   ES2551.AD   ES2692.AD   ES2697.AD   ES2713.AD   ES2816.AD
NC_072356.1 2392    SNP 0,26    0,21    0,33    21,0    28,0    16,0    22,0    16,0
NC_072356.1 2742    SNP 0,36    0,18    0,39    26,0    33,0    25,0    29,0    25,0

Again, the native R import works as expected:

read.table("~/Downloads/wtf2.tsv", header = TRUE)
#>         CHROM  POS TYPE X105388.AD X105391.AD X105407.AD ES2551.AD ES2692.AD
#> 1 NC_072356.1 2392  SNP       0,26       0,21       0,33      21,0      28,0
#> 2 NC_072356.1 2742  SNP       0,36       0,18       0,39      26,0      33,0
#>   ES2697.AD ES2713.AD ES2816.AD
#> 1      16,0      22,0      16,0
#> 2      25,0      29,0      25,0

However, in the vroom() import, the columns starting with 0,[0-9]* are being parsed as numeric with the individual values being concatenated as the digits in the new value (switch in parsing behavior between columns ES2551.AD and ES2692.AD)

vroom("~/Downloads/wtf2.tsv", delim = "\t")
#> Rows: 2 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (5): CHROM, TYPE, 105388.AD, 105391.AD, 105407.AD
#> dbl (1): POS
#> num (5): ES2551.AD, ES2692.AD, ES2697.AD, ES2713.AD, ES2816.AD
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#>   CHROM        POS TYPE  `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#>   <chr>      <dbl> <chr> <chr>       <chr>       <chr>           <dbl>     <dbl>
#> 1 NC_072356…  2392 SNP   0,26        0,21        0,33              210       280
#> 2 NC_072356…  2742 SNP   0,36        0,18        0,39              260       330
#> # ℹ 3 more variables: ES2697.AD <dbl>, ES2713.AD <dbl>, ES2816.AD <dbl>

Again, specifying the col_types avoids the dropping of the , and the concatenation of the individual values.

vroom("~/Downloads/wtf2.tsv", delim = "\t", col_types = "cdccccccccc")
#> # A tibble: 2 × 11
#>   CHROM        POS TYPE  `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#>   <chr>      <dbl> <chr> <chr>       <chr>       <chr>       <chr>     <chr>    
#> 1 NC_072356…  2392 SNP   0,26        0,21        0,33        21,0      28,0     
#> 2 NC_072356…  2742 SNP   0,36        0,18        0,39        26,0      33,0     
#> # ℹ 3 more variables: ES2697.AD <chr>, ES2713.AD <chr>, ES2816.AD <chr>

To pinpoint the cause of the parsing behavior switch, I changed a single cell (21,0 <-> 0,21 for ES2551.AD) and created an altered version of the data (wtf3.tsv):

CHROM   POS TYPE    105388.AD   105391.AD   105407.AD   ES2551.AD   ES2692.AD   ES2697.AD   ES2713.AD   ES2816.AD
NC_072356.1 2392    SNP 0,26    0,21    0,33    0,21    28,0    16,0    22,0    16,0
NC_072356.1 2742    SNP 0,36    0,18    0,39    26,0    33,0    25,0    29,0    25,0

Indeed the leading 0 now seems to cause the column ES2551.AD to be parsed as character and to conserve the , also in the default guess mode (columns ES2692.AD:ES2816.AD still exhibit the issue though):

vroom("~/Downloads/wtf3.tsv", delim = "\t")
#> Rows: 2 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (6): CHROM, TYPE, 105388.AD, 105391.AD, 105407.AD, ES2551.AD
#> dbl (1): POS
#> num (4): ES2692.AD, ES2697.AD, ES2713.AD, ES2816.AD
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#>   CHROM        POS TYPE  `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#>   <chr>      <dbl> <chr> <chr>       <chr>       <chr>       <chr>         <dbl>
#> 1 NC_072356…  2392 SNP   0,26        0,21        0,33        0,21            280
#> 2 NC_072356…  2742 SNP   0,36        0,18        0,39        26,0            330
#> # ℹ 3 more variables: ES2697.AD <dbl>, ES2713.AD <dbl>, ES2816.AD <dbl>

And again, the col_types can be used to avoid the behavior:

vroom("~/Downloads/wtf3.tsv", delim = "\t", col_types = "cdccccccccc")
#> # A tibble: 2 × 11
#>   CHROM        POS TYPE  `105388.AD` `105391.AD` `105407.AD` ES2551.AD ES2692.AD
#>   <chr>      <dbl> <chr> <chr>       <chr>       <chr>       <chr>     <chr>    
#> 1 NC_072356…  2392 SNP   0,26        0,21        0,33        0,21      28,0     
#> 2 NC_072356…  2742 SNP   0,36        0,18        0,39        26,0      33,0     
#> # ℹ 3 more variables: ES2697.AD <chr>, ES2713.AD <chr>, ES2816.AD <chr>

Session Info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3;  LAPACK version 3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Berlin
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] vroom_1.6.3.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] crayon_1.5.2      vctrs_0.6.3       cli_3.6.1         knitr_1.42       
#>  [5] rlang_1.1.1       xfun_0.39         purrr_1.0.1       styler_1.10.1    
#>  [9] glue_1.6.2        bit_4.0.5         htmltools_0.5.5   fansi_1.0.4      
#> [13] rmarkdown_2.21    R.cache_0.16.0    tibble_3.2.1      evaluate_0.20    
#> [17] tzdb_0.4.0        fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3  
#> [21] compiler_4.3.1    fs_1.6.1          pkgconfig_2.0.3   rstudioapi_0.14  
#> [25] R.oo_1.24.0       R.utils_2.12.2    digest_0.6.31     utf8_1.2.3       
#> [29] reprex_2.0.2.9000 tidyselect_1.2.0  parallel_4.3.1    pillar_1.9.0     
#> [33] magrittr_2.0.3    R.methodsS3_1.8.1 tools_4.3.1       withr_2.5.0      
#> [37] bit64_4.0.5

Created on 2023-06-27 with reprex v2.0.2.9000

jennybc commented 1 year ago

vroom's default locale assumes that . is the decimal mark, which then has implications for type guessing.

If you want type guessing to work on files like this, you need to inform vroom that , is the decimal mark.

library(vroom)
tmp <- tempfile()
vroom_write_lines(c("foo\tbar", "1,0\t0,1"), tmp)
vroom(tmp, delim = "\t", locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): foo, bar
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>     foo   bar
#>   <dbl> <dbl>
#> 1     1   0.1

Created on 2023-06-27 with reprex v2.0.2.9000

k-hench commented 1 year ago

thank you very much for the quick reply, however my point is not about decimal marks: In the example the 1,0 is a list of values (1 and 0) and NOT 1.0.

The use case here is that the column foo contains two counts that are to be separated:

library(tidyverse)
vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t", col_types = "c") |> 
  separate(foo, into = c("ref", "alt"), sep = ",")
#> # A tibble: 1 × 3                                                                                                                                                           
#>  ref   alt   bar  
#>  <chr> <chr> <chr>
#> 1 1     0     0,1 

This fails if the col_types are not specified as the separator , is removed from the column foo:

 vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t") |> 
+   separate(foo, into = c("ref", "alt"), sep = ",")
#>Rows: 1 Columns: 2                                                                                                                                                          
#>── Column specification ─
#> Delimiter: "\t"
#> chr (1): bar
#> num (1): foo
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3
#>  ref   alt   bar  
#>  <chr> <chr> <chr>
#> 1 10    NA    0,1  
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1]. 
jennybc commented 1 year ago

Ah, I see.

I think you've just bumped up against the hard fact that type guessing is hard and very fraught! I don't see anything that vroom could change that would be a net positive for users, in the large, even if you could imagine tweaks that are advantageous for this particular file.