tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.7k stars 2.12k forks source link

`dplyr_reconstruct` can create data.table with corrupted secondary index #7048

Open ZIBOWANGKANGYU opened 4 days ago

ZIBOWANGKANGYU commented 4 days ago

Problem

Thanks @AMDraghici for your suggestions!

For example, in bind_rows, if the first input is a data.table, the output table can have corrupt indexing due to how the underlying dplyr_reconstruct function deals with the attributes of the two inputs

Reprex

The example below shows that the index attribute can be incorrect for the output.

> a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c") # Create a data.table
> attributes(a)$.internal.selfref <- new("externalptr") # Set pointer to nil. This is necessary for the subset error below to happen in data.table. But it is not necessary to re-produce the corrupted index. 
> a[cola == 4] # Give data.table a secondary index ("cola" column) by auto-indexing
    cola      colb      colc   cold
   <num>     <num>     <num> <char>
1:     4 0.1495679 0.6097216      c
> 
> attributes(a)$index # The secondary index is set correctly
integer(0)
attr(,"__cola")
[1] 2 3 4 1
> 
> b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
> 
> combined <- dplyr::bind_rows(list(a,b))
> 
> combined # combined is a data.table, with 5 rows
Index: <cola>
    cola      colb      colc   cold
   <num>     <num>     <num> <char>
1:     5 0.5535855 0.6024416      c
2:     2 0.3407051 0.9291365      c
3:     3 0.5007208 0.6823528      c
4:     4 0.1495679 0.6097216      c
5:    -1 2.0000000 3.0000000      d
> 
> attributes(combined)$index # Wrong! length of secondary index is only 4
integer(0)
attr(,"__cola")
[1] 2 3 4 1
> combined[cola==-1]
Empty data.table (0 rows and 4 cols): cola,colb,colc,cold # Wrong! The last row of combined should be returned
> combined
Index: <cola>
    cola       colb      colc   cold
   <num>      <num>     <num> <char>
1:     1 0.83105427 0.4214379      c
2:     2 0.05702599 0.1354883      c
3:     3 0.63866251 0.1644736      c
4:     4 0.21441544 0.2198251      c
5:    -1 2.00000000 3.0000000      d

Cause

In the bind_rows function, dplyr_reconstruct is used to set attributes for the output dataframe. https://github.com/tidyverse/dplyr/blob/be36acf9c86e5d4c3d97f97b8d3999b713123392/R/bind-rows.R#L79

out <- dplyr_reconstruct(out, first)

Looking at the dplyr_reconstruct function, it is essentially giving all attributes other than names and row.names in template_ to data. https://github.com/tidyverse/dplyr/blob/be36acf9c86e5d4c3d97f97b8d3999b713123392/src/reconstruct.cpp#L36

In the case above, all attributes of first (which has four rows), including index are given to out, which has five rows. This causes the problem.

Impact

Because the data.table produced by bind_rows has corrupted secondary index, the filter functionality of data.table is skipping some rows when filtering by the index column. Also, I found that this problem is not limited to bind_rows. Other dplyr functions that calls dplyr_reconstruct can result in data.tables with corrupted secondary index. For example, the full_join function can also produce unexpected results due to corrupted secondary index.

> a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
> 
> attributes(a)$.internal.selfref <- new("externalptr") # Set pointer to nil
> a[cola == 3]
       cola         colb         colc      cold
   <int>     <num>     <num> <char>
1:     3 0.9968646 0.8137836      d
> 
> b <- data.table::data.table(cola = -1, cole = "e")
> 
> combined <- dplyr::full_join(a, b, by = "cola")
> 
> combined[cola==-1]
Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole
etiennebacher commented 3 days ago

Just putting the same examples with clearer (IMO) formatting:

Example 1

# Create a data.table
a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c") 
# Set pointer to nil. This is necessary for the subset error below to happen 
# in data.table. But it is not necessary to re-produce the corrupted index. 
attributes(a)$.internal.selfref <- new("externalptr") 
# Give data.table a secondary index ("cola" column) by auto-indexing
a[cola == 4] 
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     4 0.8401062 0.09284545      c

# The secondary index is set correctly
attributes(a)$index 
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1

b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
combined <- dplyr::bind_rows(list(a,b))
# combined is a data.table, with 5 rows
combined 
#> Index: <cola>
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     5 0.4526811 0.38061661      c
#> 2:     2 0.6131192 0.28859921      c
#> 3:     3 0.7053851 0.85011065      c
#> 4:     4 0.8401062 0.09284545      c
#> 5:    -1 2.0000000 3.00000000      d

# Wrong! length of secondary index is only 4
attributes(combined)$index 
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1
combined[cola==-1]
#> Error: Internal error: index 'cola' exists but is invalid
combined
#> Index: <cola>
#>     cola      colb       colc   cold
#>    <num>     <num>      <num> <char>
#> 1:     5 0.4526811 0.38061661      c
#> 2:     2 0.6131192 0.28859921      c
#> 3:     3 0.7053851 0.85011065      c
#> 4:     4 0.8401062 0.09284545      c
#> 5:    -1 2.0000000 3.00000000      d

Example 2

a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
# Set pointer to nil
attributes(a)$.internal.selfref <- new("externalptr") 
a[cola == 3]
#>     cola      colb      colc   cold
#>    <int>     <num>     <num> <char>
#> 1:     3 0.1962404 0.8902132      d

b <- data.table::data.table(cola = -1, cole = "e")
combined <- dplyr::full_join(a, b, by = "cola")
combined
#> Index: <cola>
#>     cola      colb      colc   cold   cole
#>    <num>     <num>     <num> <char> <char>
#> 1:     1 0.1566911 0.6529508      d   <NA>
#> 2:     2 0.7213704 0.9832597      d   <NA>
#> 3:     3 0.1962404 0.8902132      d   <NA>
#> 4:     4 0.5184152 0.3268725      d   <NA>
#> 5:    -1        NA        NA   <NA>      e
combined[cola==-1]
#> Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole