strumke / hsic_python

6 stars 0 forks source link

Question on calculating dHSIC #1

Closed lizhenstat closed 3 years ago

lizhenstat commented 3 years ago

Hi, thanks for your code on calculating HSIC on multiple groups, I have on question on Definition 2.6 in the paper dHSIC

I do not understand the meaning of subscript of x, such as

dHSIC-variable

Can you give me some insights about it? I know Mq(n) is the q-fold Cartesian product of the set {1,2,...,n}

the related code part is as follows

def dHSIC_calc(K_list):
    """
    Calculate the HSIC estimator in the general case d > 2, as in
    [2] Definition 2.6
    """
    if not isinstance(K_list, list):
        K_list = list(K_list)

    n_k = len(K_list) # d

    length = K_list[0].shape[0] # n
    term1 = 1.0
    term2 = 1.0
    term3 = 2.0/length

    for j in range(0, n_k):
        K_j = K_list[j]
        term1 = numpy.multiply(term1, K_j)
        term2 = 1.0/length/length*term2*numpy.sum(K_j)
        term3 = 1.0/length*term3*K_j.sum(axis=0)

    term1 = numpy.sum(term1)
    term3 = numpy.sum(term3)
    dHSIC = (1.0/length)**2*term1+term2-term3
    return dHSIC

My question is that term1 is added for n 2 times, term2 is added for n (2d) times while term3 is added for n** (d+1) times However in the code, all these terms are added for 2d times, right? Can you give me some insights about it? Thanks a lot and best wishes

lizhenstat commented 3 years ago

I got it, the i{1}, i{2},... do not has the specific meaning, it just show whether the subscript is the same or not. Thanks

Term1 is added for n^2 times, since there is a sum inside each calculating. term2 and term3 is alike.

Besides, the multiply over summation is equivalent to summation over multiply.

This code is similar to that dHSIC test code in R

  ###
  # Compute dHSIC
  ###
  ptm <- proc.time()
  term1 <- 1
  term2 <- 1
  term3 <- 2/len
  for (j in 1:d){
    term1 <- term1*K[[j]]
    term2 <- 1/len^2*term2*sum(K[[j]])    
    term3 <- 1/len*term3*colSums(K[[j]])
  }
  term1 <- sum(term1)
  term3 <- sum(term3)
  dHSIC=1/len^2*term1+term2-term3
  timeHSIC <- as.numeric((proc.time() - ptm)[1])
strumke commented 3 years ago

Hi! I'm sorry I didn't get back to you in time. In case it's still useful, the first level subscripts represent the ith vector x_i, and the second level subscripts represent the coordinate of the vector. Happy correlating!