pboesu / rucrdtw

R Bindings for the UCR Suite for fast time series subsequence search
Other
15 stars 4 forks source link

rucrdtw for distance matrix computation #7

Closed pfistfl closed 1 year ago

pfistfl commented 7 years ago

Hi,

I was trying to use rucrdtw in order to compute distance matrices (for use in knn etc.) You do not export any functions that could be used explicitly for this but the _vv functions should do the trick. Is there anything that would speak against doing that?

One thing I came across is the following:

library(rucrdtw)
set.seed(12444)
x = rnorm(10^2)
y = rnorm(10^2)

eucl = function(x, y) {sqrt(sum((x - y)^2))}
eucl(x, y)
# 14.54399

rucrdtw::ucred_vv(x, y)
#  14.74781

How come there is such a huge discrepancy between the results?

pboesu commented 7 years ago

Hi,

thanks for pointing out this inconsistency. My first thought, was that this is due to the fact that the reported distances are calculated only after Z-normalization (which isn't obvious from the documentation, my apologies), but that doesn't appear to account for the difference in the distances

library(rucrdtw)
set.seed(12444)
x = rnorm(10^2)
y = rnorm(10^2)

eucl = function(x, y) {sqrt(sum((x - y)^2))}
eucl(x, y)
# 14.54399

#Z-normalisation
xn <- (x - mean(x)) / sd(x)
yn <- (y - mean(y)) / sd(y)

eucl(xn, yn)
#[1] 14.67389

rucrdtw::ucred_vv(x, y)
#  14.74781

I will try to have a closer look at the Rcpp code asap and see if I can figure out what is going on.

Not to distract from a potential bug here, but generally the UCR methods aren't necessarily the best way to compute full distance matrices. The optimizations in the algorithm are directed at sequential search problems, where one is solely interested in the best match. The enormous speed gains come from terminating most pairwise comparisons prematurely, i.e. the corresponding distances are never fully computed (this is very nicely explained in the original paper by Rakthanmanon et al. ).

pfistfl commented 7 years ago

Thanks a lot for your answer!

I think I get the point on early abandoning.

The question would thus be whether rucrdtw::ucrdtw_vv()is still faster/more memory efficient than dtw::dtw().

Did you do any comparisons towards whether dtw::dtw() returns values close to ucrdtw::ucrdtw_vv()?

pboesu commented 7 years ago

Continuing from your example, benchmarking does show that ucrdtw_vv is considerably faster.

xm <- matrix(xn, nrow = 1, byrow = T)
ym <- matrix(yn, nrow = 1, byrow = T)

rbenchmark::benchmark(
    ucrdtw = rucrdtw::ucrdtw_vv(x, y, 0.1),
    dtwdtw = dtw::dtw(xn, yn, window.type = "sakoechiba", window.size= 10, distance.only = TRUE),
    dtwDist = dtw::dtwDist(xm, ym, window.type = "sakoechiba", window.size= 10),
    replications = 1000)

#    test replications elapsed relative user.self sys.self user.child sys.child
#3 dtwDist         1000   3.383   70.479     3.151    0.185          0         0
#2  dtwdtw         1000   2.994   62.375     2.780    0.191          0         0
#1  ucrdtw         1000   0.048    1.000     0.043    0.005          0         0

However, the reported distances are not similar at all, which is really worrying. The unit tests I wrote for the rucrdtw-package were all based on whether or not a known target sequence is recovered in a search. I did not even consider looking at the absolute values of the distance metrics, given my work on the package was limited to writing the R bindings for published C++ code.

I will have to carefully go through all the code to better understand how much of this is due to the normalisations and/or other tricks used to exploit early abandoning. It may take me a while until I find the time to do this.

pfistfl commented 7 years ago

Hi,

I am not sure whether I am not comparing apples and oranges, as I do not yet fully understand the different step.patterns but this does not seem to be a problem in your code.

UCR_DTW and UCR_ED from the command line tool report the same distances as your package.

Again, I think I am comparing apples and oranges because I do not really understand the output of the following comparisons:

library(rucrdtw)
library(dtw)
set.seed(12444)
x = runif(10^2)
x = (x - mean(x)) / sd(x) # z-normalized
y = runif(10^2)
y = (y - mean(y)) / sd(y) # z-normalized

# Compare for different window sizes:
sapply(seq(from = 0, to = .1, by = .01), function (window) {
  a = dtw(x, y, step.pattern = symmetric1, window.type = "sakoechiba",
    window.size = floor(window*length(x)), distance.only = FALSE, method = "Euclidean")
  c(
    "window" = window,
    "dtw_dist" = sqrt(a$distance), #dtw::dtw()$distance
    "ucrdtw" = ucrdtw_vv(x, y, skip = FALSE, dtwwindow = window)$distance, # ucrdtw distance
    "dtw_path_index" = sqrt(sum((x[a$index1] - y[a$index2])^2)), # euclidean distance of warping path
    "ucr_ed_path_index" = ucred_vv(x[a$index1], y[a$index2])$distance, # ucr_ed of warping path
    "euclidean" = sqrt(sum((x-y)^2)) # simple euclidean distance
  )
})
pfistfl commented 7 years ago

Sorry for writing again, I have another problem:

I am working on windows and therefore I can not really debug this, but when I run the following code:

gc()

library(rucrdtw)
set.seed(12444)
x = rnorm(10^4)
y = rnorm(10^4)
for (i in 1:10^4) ucrdtw_vv(x, y, dtwwindow = 0.01)

My rsession.exe starts to eat up more and more memory. (1.8GB of RAM each time I run the for loop). This slowly fills up the RAM and finally crashes the R-Session (or results in the following error:).

Error in ucrdtw_vv(x, y, dtwwindow = 0.01) : ERROR : Memory can't be allocated!!!

Even calling gc(reset = TRUE) does not free up the memory. Note, that no objects other than x and y are saved.

The only thing that helps for me is restarting the R-Session.

Is this a personal problem? Should I post this in a separate issue?

pboesu commented 7 years ago

No apologies required - thank you for thoroughly test-riding rucrdtw!

I can reproduce this behaviour on Windows and Linux, so it looks like you found a memory leak. I've opened a new issue for this (https://github.com/pboesu/rucrdtw/issues/8), and will try to fix this once I am back at work next week.

instantkaffee commented 1 year ago

Is this fixed meanwhile? The last post was 2017? Is this library still maintained actively?

pboesu commented 1 year ago

The memory leak was fixed with PR https://github.com/pboesu/rucrdtw/issues/8#issuecomment-336476813 . Otherwise I am maintaining the package so that it continues to pass CRAN checks, but am not planning on adding or modifying functionality.