Open ZIBOWANGKANGYU opened 4 days ago
Just putting the same examples with clearer (IMO) formatting:
# Create a data.table
a <- data.table::data.table(cola = c(5, 2:4), colb = runif(4), colc= runif(4), cold = "c")
# Set pointer to nil. This is necessary for the subset error below to happen
# in data.table. But it is not necessary to re-produce the corrupted index.
attributes(a)$.internal.selfref <- new("externalptr")
# Give data.table a secondary index ("cola" column) by auto-indexing
a[cola == 4]
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 4 0.8401062 0.09284545 c
# The secondary index is set correctly
attributes(a)$index
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1
b <- data.table::data.table(cola = -1, colb = 2, colc=3, cold = "d")
combined <- dplyr::bind_rows(list(a,b))
# combined is a data.table, with 5 rows
combined
#> Index: <cola>
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 5 0.4526811 0.38061661 c
#> 2: 2 0.6131192 0.28859921 c
#> 3: 3 0.7053851 0.85011065 c
#> 4: 4 0.8401062 0.09284545 c
#> 5: -1 2.0000000 3.00000000 d
# Wrong! length of secondary index is only 4
attributes(combined)$index
#> integer(0)
#> attr(,"__cola")
#> [1] 2 3 4 1
combined[cola==-1]
#> Error: Internal error: index 'cola' exists but is invalid
combined
#> Index: <cola>
#> cola colb colc cold
#> <num> <num> <num> <char>
#> 1: 5 0.4526811 0.38061661 c
#> 2: 2 0.6131192 0.28859921 c
#> 3: 3 0.7053851 0.85011065 c
#> 4: 4 0.8401062 0.09284545 c
#> 5: -1 2.0000000 3.00000000 d
a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
# Set pointer to nil
attributes(a)$.internal.selfref <- new("externalptr")
a[cola == 3]
#> cola colb colc cold
#> <int> <num> <num> <char>
#> 1: 3 0.1962404 0.8902132 d
b <- data.table::data.table(cola = -1, cole = "e")
combined <- dplyr::full_join(a, b, by = "cola")
combined
#> Index: <cola>
#> cola colb colc cold cole
#> <num> <num> <num> <char> <char>
#> 1: 1 0.1566911 0.6529508 d <NA>
#> 2: 2 0.7213704 0.9832597 d <NA>
#> 3: 3 0.1962404 0.8902132 d <NA>
#> 4: 4 0.5184152 0.3268725 d <NA>
#> 5: -1 NA NA <NA> e
combined[cola==-1]
#> Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole
Problem
Thanks @AMDraghici for your suggestions!
For example, in
bind_rows
, if the first input is adata.table
, the output table can have corrupt indexing due to how the underlyingdplyr_reconstruct
function deals with the attributes of the two inputsReprex
The example below shows that the index attribute can be incorrect for the output.
Cause
In the
bind_rows
function,dplyr_reconstruct
is used to set attributes for the output dataframe. https://github.com/tidyverse/dplyr/blob/be36acf9c86e5d4c3d97f97b8d3999b713123392/R/bind-rows.R#L79Looking at the
dplyr_reconstruct
function, it is essentially giving all attributes other thannames
androw.names
intemplate_
todata
. https://github.com/tidyverse/dplyr/blob/be36acf9c86e5d4c3d97f97b8d3999b713123392/src/reconstruct.cpp#L36In the case above, all attributes of
first
(which has four rows), including index are given toout
, which has five rows. This causes the problem.Impact
Because the
data.table
produced bybind_rows
has corrupted secondary index, the filter functionality ofdata.table
is skipping some rows when filtering by the index column. Also, I found that this problem is not limited tobind_rows
. Otherdplyr
functions that callsdplyr_reconstruct
can result in data.tables with corrupted secondary index. For example, thefull_join
function can also produce unexpected results due to corrupted secondary index.