Closed k-hench closed 1 year ago
vroom's default locale assumes that .
is the decimal mark, which then has implications for type guessing.
If you want type guessing to work on files like this, you need to inform vroom that ,
is the decimal mark.
library(vroom)
tmp <- tempfile()
vroom_write_lines(c("foo\tbar", "1,0\t0,1"), tmp)
vroom(tmp, delim = "\t", locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): foo, bar
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#> foo bar
#> <dbl> <dbl>
#> 1 1 0.1
Created on 2023-06-27 with reprex v2.0.2.9000
thank you very much for the quick reply, however my point is not about decimal marks:
In the example the 1,0
is a list of values (1
and 0
) and NOT 1.0
.
The use case here is that the column foo
contains two counts that are to be separated:
library(tidyverse)
vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t", col_types = "c") |>
separate(foo, into = c("ref", "alt"), sep = ",")
#> # A tibble: 1 × 3
#> ref alt bar
#> <chr> <chr> <chr>
#> 1 1 0 0,1
This fails if the col_types
are not specified as the separator ,
is removed from the column foo
:
vroom::vroom("~/Downloads/wtf1.tsv", delim = "\t") |>
+ separate(foo, into = c("ref", "alt"), sep = ",")
#>Rows: 1 Columns: 2
#>── Column specification ─
#> Delimiter: "\t"
#> chr (1): bar
#> num (1): foo
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 3
#> ref alt bar
#> <chr> <chr> <chr>
#> 1 10 NA 0,1
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
Ah, I see.
I think you've just bumped up against the hard fact that type guessing is hard and very fraught! I don't see anything that vroom could change that would be a net positive for users, in the large, even if you could imagine tweaks that are advantageous for this particular file.
Hello :wave: ,
I stumbled on what I believe to be a bug, or at least quite some dangerous behavior where in the data import with
vroom::vroom()
the data are altered in unexpected (for me at least) ways. The issue is reminds me a little of the infamous re-formatting of data in the Excell:In a nutshell, in cells with aggregated data (eg in the form
1,0
),vroom()
in some cases appears drop the,
character and then interpret the concatenated value (10
) asnumeric
. However, this does not appear to happen in a consistent way, such that sometimes0,1
is actually interpreted as"0,1"
(character
). I believe that this depends on the leading value being a0
vs.[1-9]
.This behavior can be avoided when specifying the column types (
vroom(..., col_types = "<types>")
), however I believe that the default guess mode likely used frequently enough to raise the issue.If the descried behavior is actually just a description of
vroom()
working as intended and this is just a case of rtfm, then please never mind. (However, having usedvroom()
for many years and yet being surprised by this myself, I hope that this heads-up from the user-perspective might still be of value.)Minimal example
Creating minimal data set
The exported file (
wtf1.tsv
) should look like this:Native
R
data import works as expectedOn import with
vroom::vroom()
and automated data type detection, the,
character is omitted and the value altered from1,0
to10
:Importing the data with specified column types ("character") does not alter the data however:
Minimal 'real-life' example
I stumbled on the behavior while using
readr::read_tsv()
, which I believe usesvroom()
under the hood. The original data is a summary file produced by the genomics softwaregatk
, which is very widely used through genomics community.A slimmed down version of the original file looks like this (
wtf2.tsv
):Again, the native
R
import works as expected:However, in the
vroom()
import, the columns starting with0,[0-9]*
are being parsed asnumeric
with the individual values being concatenated as the digits in the new value (switch in parsing behavior between columnsES2551.AD
andES2692.AD
)Again, specifying the
col_types
avoids the dropping of the,
and the concatenation of the individual values.To pinpoint the cause of the parsing behavior switch, I changed a single cell (
21,0
<->0,21
forES2551.AD
) and created an altered version of the data (wtf3.tsv
):Indeed the leading
0
now seems to cause the columnES2551.AD
to be parsed ascharacter
and to conserve the,
also in the default guess mode (columnsES2692.AD:ES2816.AD
still exhibit the issue though):And again, the
col_types
can be used to avoid the behavior:Session Info
Created on 2023-06-27 with reprex v2.0.2.9000