Order contingency in hstore?

mbannert commented 6 years ago

@HomoCodens I think we need to pay close attention to this discussion, I just started:

https://stackoverflow.com/questions/50287646/how-to-reproduce-contingency-of-order-in-hstore

I am afraid we do not do enough to ensure the correct order of our R objects. I am not aware of any issues with order problems and also our unit tests pass, but still it makes me feel uneasy.

Thinking about introducing and additional sort here... but don't want to because of all the order and type cast costs...

From readTimeSeries:

{
        if(freq == 4){
          period <- (p -1) / 3 + 1
        } else if(freq == 2) {
          period <- ifelse(p == 1,1,2)
        } else if(freq == 12){
          period <- p
        } else if(freq == 1){
          period <- NULL  
        }
        # create the time series object but suppress the warning of creating NAs
        # when transforming text NAs to numeric NAs
        stats::ts(ts_data,
           start=c(y,period),
           frequency = freq)
      }

HomoCodens commented 6 years ago

I absolutely agree.

I don't think we have immediate cause for concern as the data don't get touched after being written to the DB so in practice the order is preserved. That does not mean that this will remain to be the case forever though.

I don't have a working copy right now but maybe we can do something with these d_chars

# R internals :) 
# only convert the first element to date cause this is costly for the 
# entire vector !! the character vector (d_chars) is sorted, too,
# which is all we need for zoo !!!
d <- as.Date(d_chars[1])
y <- as.numeric(format(d,"%Y"))
p <- as.numeric(format(d,"%m"))

Not sure if we can properly sort them without converting to Date though.

A heavier approach would be to change how we represent the data in the database. Since we work with ts exclusively (although there is also code concerning irregular time series in readTimeSeries?) we could store them something like this:

{
    "start": 1988.75,
    "frequency": 12,
    "data": [1, 2, 3, null, 4, ...]
}

either as a JSON string or postgres JSON. That would preserve the order.

HomoCodens commented 6 years ago

Actually, looks like order works just fine on date strings (as long as they are properly zero-padded, I assume):

dates <- c("2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01")
order(dates)
dates[order(dates)]
dates2 <- c(dates[2], dates[1], dates[3:4])
dates2[order(dates2)]

> dates <- c("2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01")
> order(dates)
[1] 1 2 3 4
> dates[order(dates)]
[1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01"
> dates2 <- c(dates[2], dates[1], dates[3:4])
> dates2[order(dates2)]
[1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01"

mbannert commented 6 years ago

by date string you mean standard-format-date-but-still-character ? If we're sure that everything is covered this is perhaps a good option. I just wonder whether order is expensive. Plus, given that there's no immediate need to react now, we should maybe rather think of a more comprehensive overhaul. Nevertheless I would like to bring this version to CRAN soon.

HomoCodens commented 6 years ago

Exactly.

Ordering 6012 such strings:

> dates <- seq(as.Date("2000-01-01"), as.Date("2500-12-31"), by = "1 month")
> dates <- as.character(dates)
> dates <- dates[sample(length(dates))]
> microbenchmark(order(dates), times = 10)
Unit: milliseconds
         expr      min      lq     mean   median       uq      max neval
 order(dates) 55.54707 56.8505 60.62766 61.12514 63.68828 67.09772    10

Ordering 300 dates 100 times, which is probably closer to our scenarios:

> microbenchmark({for(i in 1:100) { order(dates[1:300]) }}, times = 10)
Unit: milliseconds
                                                         expr     min       lq     mean   median       uq
 {     for (i in 1:100) {         order(dates[1:300])     } } 122.043 127.8955 128.4337 129.3276 130.2437
      max neval
 133.1075    10

Fun fact as an aside: Looks like Dates do not support 5 digit years yet. Sloppy future-proofing...

mbannert commented 6 years ago

closed for now.

mbannert / timeseriesdb

Order contingency in hstore? #47