Closed joshuaulrich closed 5 years ago
I did some more checking.
# remove rownames from data
rownames(Data) <- NULL
test_xts_date <- xts(x=Data[, -1], order.by = as.Date(Data$timestamp))
# check if identical xts and xts with index
identical(test_xts_date, test_xts_date[index(test_xts_date)])
[1] FALSE
test_date <- data.frame(timestamp = as.character(index(test_xts_date)), user = coredata(test_xts_date), stringsAsFactors = FALSE)
Data2 <- data.frame(timestamp = as.character(as.Date(as.POSIXct(Data$timestamp))), user = Data$user, stringsAsFactors = FALSE)
# Check if input equals output
identical(Data2, test_date)
[1] TRUE
So the issue is purely when indexing / selecting records within the xts object. Now if you look at the outcome of the xts with index compared to the input data, you can see that the first 2 records disappeared.
test_xts_date[index(test_xts_date)])
[,1]
2014-07-31 2
2014-08-09 3
2014-08-14 3
2014-08-14 3
2014-08-14 3
2014-08-14 3
2014-08-16 3
2014-08-19 3
2014-12-28 4
2015-01-17 4
But the issue is not as such with the index. If I change the index with the dates from the Data$timestamp you have the same issue.
dates <- as.Date(Data$timestamp)
test_xts_date[dates]
[,1]
2014-07-31 2
2014-08-09 3
2014-08-14 3
2014-08-14 3
2014-08-14 3
2014-08-14 3
2014-08-16 3
2014-08-19 3
2015-01-17 4
2015-01-17 4
I also checked what happened if the timestamp was not passed as a date, but as Posixct. Then it works correctly, but only if there are no duplicate timestamps. As soon as you introduce a duplicate record, the issue appears.
I will do some more tests later.
I have narrowed the issue down to the function fill_window_dups_rev
. When window_idx
is called to subset the xts object, all the firstlast objects and the tmp object just before calling fill_window_dups_rev
contain 10 records. The res object after calling fill_window_dups_rev
suddenly contains 12 records with the first 2 as NA and index 5 and 6 (the ones with date "2014-08-14") are duplicated in the res object. This is passed into the last filling of firstlast and hence you get duplicated records for index position 5 and 6 and index positions 1 and 2 are dropped.
debugging:
if(usr_idx && !is.null(firstlast)) {
# Translate from user .index to xts index
# We get back upper bound of index as per findInterval
tmp <- base_idx[firstlast]
res <- .Call("fill_window_dups_rev", tmp, .index(x), PACKAGE = "xts")
firstlast <- rev(res)
}
result of the objects:
tmp
[1] 1 2 3 4 6 6 7 8 9 10
res
[1] 10 9 8 7 6 5 6 5 4 3 NA NA
firstlast
[1] NA NA 3 4 5 6 5 6 7 8 9 10
Which when the NA's are removed from the firstlast object the outcome results in the issue specified above.
Thanks for the detailed investigation @pverspeelt! The fill_window_dups_rev()
function was introduced as part of the fix for #251. I used the previous commit (1d707c57042203b7e53d7a8c5bd0ad173d9840d6) to check whether this was a regression caused by the fix, or if it existed prior to the fix as well.
The output below shows the results prior to introducing the fill_window_dups_rev()
function.
require(xts)
Data <- structure(
list(
timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00",
"2014-08-14 20:00:00", "2014-08-14 22:00:00",
"2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00"),
user = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4)),
.Names = c("timestamp", "user"),
row.names = c("220667", "331481", "422653", "629430", "378111", "646137",
"558638", "151641", "599370", "482750"),
class = "data.frame")
(x <- xts(Data$user, as.Date(Data$timestamp)))
# [,1]
# 2013-03-06 1
# 2014-07-06 2
# 2014-07-31 2
# 2014-08-09 3
# 2014-08-14 3
# 2014-08-14 3
# 2014-08-16 3
# 2014-08-19 3
# 2014-12-28 4
# 2015-01-17 4
x[index(x)] # Different (wrong) data. Why?
# [,1]
# 2013-03-06 1
# 2014-07-06 2
# 2014-07-31 2
# 2014-08-09 3
# 2014-08-14 3
# 2014-08-14 3
# 2014-08-14 3
# 2014-08-14 3
# 2014-08-16 3
# 2014-08-19 3
# 2014-12-28 4
# 2015-01-17 4
packageVersion("xts")
# [1] '0.11.0'
These results look correct. There are duplicate values for 2014-08-14
, but that isn't obviously incorrect. The index contains duplicate values for that date, and the i
in the subset also contains duplicate values for that date. So you could argue that all observations with that date should be returned for each value of that date in i
.
In situations where behavior is unclear, I prefer to defer to consistent behavior with zoo. That's not possible in this case, because zoo does not attempt to handle objects with duplicate index values. For example:
z <- as.zoo(x)
z[index(z)]
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-14 2014-08-16
# 1 2 2 3 3 3 3
# 2014-08-19 2014-12-28 2015-01-17
# 3 4 4
# Warning message:
# In zoo(rval, index(x)[i]) :
# some methods for "zoo" objects do not work if the index entries in 'order.by' are not unique
We may consider what zoo does when you subset a zoo object with and i
that contains duplicate values. Although we still need to keep in mind that zoo may behave this way because it doesn't try to handle objects with duplicate index values.
z[-5,][index(z)] # Remove one of the 2014-08-14 rows
# 2013-03-06 2014-07-06 2014-07-31 2014-08-09 2014-08-14 2014-08-16 2014-08-19
# 1 2 2 3 3 3 3
# 2014-12-28 2015-01-17
# 4 4
Given a xts object with a date index that contains duplicate dates, incorrect results are returned if the object is subset with its own date index.
Thanks to scs for their Stackoverflow question.
Session Info