joshuaulrich / xts

Extensible time series class that provides uniform handling of many R time series classes by extending zoo.
http://joshuaulrich.github.io/xts/
GNU General Public License v2.0
220 stars 71 forks source link

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Closed joshuaulrich closed 5 years ago

joshuaulrich commented 5 years ago

Given a xts object with a date index that contains duplicate dates, incorrect results are returned if the object is subset with its own date index.

Thanks to scs for their Stackoverflow question.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4

Session Info

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.11-2 zoo_1.8-3 

loaded via a namespace (and not attached):
[1] compiler_3.5.1  grid_3.5.1      lattice_0.20-35
pverspeelt commented 5 years ago

I did some more checking.

# remove rownames from data
rownames(Data) <- NULL

xts with as.Date test

test_xts_date <- xts(x=Data[, -1], order.by = as.Date(Data$timestamp))

# check if identical xts and xts with index
identical(test_xts_date, test_xts_date[index(test_xts_date)])
[1] FALSE

test_date <- data.frame(timestamp = as.character(index(test_xts_date)), user = coredata(test_xts_date), stringsAsFactors = FALSE)
Data2 <- data.frame(timestamp = as.character(as.Date(as.POSIXct(Data$timestamp))), user = Data$user, stringsAsFactors = FALSE)

# Check if input equals output
identical(Data2, test_date)
[1] TRUE

So the issue is purely when indexing / selecting records within the xts object. Now if you look at the outcome of the xts with index compared to the input data, you can see that the first 2 records disappeared.

test_xts_date[index(test_xts_date)])
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2014-12-28    4
2015-01-17    4

But the issue is not as such with the index. If I change the index with the dates from the Data$timestamp you have the same issue.

dates <- as.Date(Data$timestamp)
test_xts_date[dates]
           [,1]
2014-07-31    2
2014-08-09    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-14    3
2014-08-16    3
2014-08-19    3
2015-01-17    4
2015-01-17    4

I also checked what happened if the timestamp was not passed as a date, but as Posixct. Then it works correctly, but only if there are no duplicate timestamps. As soon as you introduce a duplicate record, the issue appears.

I will do some more tests later.

pverspeelt commented 5 years ago

I have narrowed the issue down to the function fill_window_dups_rev. When window_idx is called to subset the xts object, all the firstlast objects and the tmp object just before calling fill_window_dups_rev contain 10 records. The res object after calling fill_window_dups_rev suddenly contains 12 records with the first 2 as NA and index 5 and 6 (the ones with date "2014-08-14") are duplicated in the res object. This is passed into the last filling of firstlast and hence you get duplicated records for index position 5 and 6 and index positions 1 and 2 are dropped.

debugging:

if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]

    res <- .Call("fill_window_dups_rev", tmp, .index(x), PACKAGE = "xts")
    firstlast <- rev(res)
  }

result of the objects:

tmp
 [1]  1  2  3  4  6  6  7  8  9 10

res
 [1] 10  9  8  7  6  5  6  5  4  3 NA NA

firstlast 
[1] NA NA  3  4  5  6  5  6  7  8  9 10

Which when the NA's are removed from the firstlast object the outcome results in the issue specified above.

joshuaulrich commented 5 years ago

Thanks for the detailed investigation @pverspeelt! The fill_window_dups_rev() function was introduced as part of the fix for #251. I used the previous commit (1d707c57042203b7e53d7a8c5bd0ad173d9840d6) to check whether this was a regression caused by the fix, or if it existed prior to the fix as well.

The output below shows the results prior to introducing the fill_window_dups_rev() function.

require(xts)
Data <- structure(
  list(
    timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
                  "2014-07-31 23:00:00", "2014-08-09 17:00:00",
                  "2014-08-14 20:00:00", "2014-08-14 22:00:00",
                  "2014-08-16 15:00:00", "2014-08-19 02:00:00", 
                  "2014-12-28 18:00:00", "2015-01-17 17:00:00"),
    user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
  .Names = c("timestamp", "user"),
  row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
                "558638", "151641", "599370", "482750"),
  class = "data.frame")   

(x <- xts(Data$user, as.Date(Data$timestamp)))
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
x[index(x)]  # Different (wrong) data. Why?
#            [,1]
# 2013-03-06    1
# 2014-07-06    2
# 2014-07-31    2
# 2014-08-09    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-14    3
# 2014-08-16    3
# 2014-08-19    3
# 2014-12-28    4
# 2015-01-17    4
packageVersion("xts")
# [1] '0.11.0'

These results look correct. There are duplicate values for 2014-08-14, but that isn't obviously incorrect. The index contains duplicate values for that date, and the i in the subset also contains duplicate values for that date. So you could argue that all observations with that date should be returned for each value of that date in i.

In situations where behavior is unclear, I prefer to defer to consistent behavior with zoo. That's not possible in this case, because zoo does not attempt to handle objects with duplicate index values. For example:

z <- as.zoo(x)
z[index(z)]
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-14 2014-08-16 
#          1          2          2          3          3          3          3 
# 2014-08-19 2014-12-28 2015-01-17 
#          3          4          4 
# Warning message:
# In zoo(rval, index(x)[i]) :
#   some methods for "zoo" objects do not work if the index entries in 'order.by' are not unique

We may consider what zoo does when you subset a zoo object with and i that contains duplicate values. Although we still need to keep in mind that zoo may behave this way because it doesn't try to handle objects with duplicate index values.

z[-5,][index(z)]  # Remove one of the 2014-08-14 rows
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-16 2014-08-19 
#          1          2          2          3          3          3          3 
# 2014-12-28 2015-01-17 
#          4          4