rusty1s / pytorch_sparse

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations
MIT License
1.01k stars 147 forks source link

how to implement cdist #378

Open KukumavMozolo opened 3 months ago

KukumavMozolo commented 3 months ago

Hi there, thanks a million for this library. I am trying to figure out how compute distances between every row(or column) of two matrices e.g. like torch.cdist.

Here is one way:

def sparse_cdist(a: SparseTensor, b: SparseTensor):
        a_repeated = cat([a] * a.size(0), dim=0)
        b_repeated = cat(
            [cat([b[i, :]] * b.size(0), dim=0) for i in range(b.size(0))], dim=0
        )
        distances = sparse_distance(a_repeated, b_repeated)
        distances.requires_grad = False
    return distances.view((a.size(0), b.size(0)))

and another:


def sparse_cdist2(a: SparseTensor, b: SparseTensor):
    with torch.no_grad():
        distances = torch.zeros(
            (a.size(0), b.size(0)),
            device=a.device(),
            dtype=a.dtype(),
            requires_grad=False,
        )
        counter = 0
        for i in range(a.size(0)):
            for j in range(b.size(0)):
                distances[i, j] = sparse_distance(a[i, :], b[j, :])
    return distances

and another:


def sparse_cdist3(a: SparseTensor, b: SparseTensor):
    with torch.no_grad():
        distances = torch.zeros(
            (a.size(0), b.size(0)),
            device=a.device(),
            dtype=a.dtype(),
            requires_grad=False,
        )
        if a.size(0) <= b.size(0):
            for i in range(a.size(0)):
                idx = torch.tensor([i])
                idx = idx.expand(b.size(0))
                a_repeated = a.index_select(0,idx)
                distances[i, :] = sparse_distance(a_repeated, b)
        else:
            for j in range(b.size(0)):
                idx = torch.tensor([j])
                idx = idx.expand(a.size(0))
                b_repeated = b.index_select(0,idx)
                distances[:, j] = sparse_distance(a, b_repeated)
    return distances

where sparse_distance is defined as follows

def sparse_distance(a: SparseTensor, b: SparseTensor):
    c = a + b.mul_nnz(torch.tensor(-1).to(device), "coo")
    c = c * c
    c = reduction(c, dim=1, reduce="sum")
    return torch.sqrt(c + 0.000000001)

The first one has the disadvantage that it creates huge matrices and eats a lot of memory, while the second one doesn't benefit from gpu parallelism. The third seems to be okish but maybe there is an even better solution? Ideally one would only load matrices a and b onto the gpu and only reserve additional memory for the result. Maybe somebody has an idea how to do that with what is currently possible with torch sparse? Or would it be necessary to write a specific cuda kernel for that? Any suggestions are very welcome.

rusty1s commented 3 months ago

Interesting problem. The best way to implement this is indeed with a custom CUDA kernel, in which way you can drop the for-loop in option2/3 and avoid the memory blowup of option1. I think option3 is okay if you don't want to go into this rabbit hole of custom CUDA.

KukumavMozolo commented 2 months ago

So i am currently busy implementing a cuda kernel that computes the euclidean distance between two matrices e.g. similar to torch.cdist.

Unfortunately i ran into a problem while implementing the backward path when computing the local gradients:

Some definitions:

Let $X \in R^{NxC}$ and $Y \in R^{MxC}$ then $cdist(X,Y) \in R^{MxN}$ and $cdist(X,Y)_{k,l} = \sqrt{\sum_j (X_{k,j} - Y_{l,j})^2}$

The derivative: Here is the derivative of cdist with respect to some entry of X: $\frac{\nabla cdist(X,Y)_{k,l}}{\nabla X_{k,n}} = \frac{X_{k,n} - Y_{l,n}}{\sqrt{\sum_j (X_{k,j} - Y_{l,j})^2}}$

Then there can be 4 cases that need to be treated differently with sparse matrices(Ignoring that the euclidean distance is not differentiable at zero for now): Case 1: $X_{k,n} \neq 0 \, and \, Y_{l,n} \neq 0 \,\,\rightarrow \frac{\nabla cdist(X,Y)_{k,l}}{\nabla X_{k,n}} = \frac{X_{k,n} - Y_{l,n}}{\sqrt{\sum_j (X_{k,j} - Y_{l,j})^2}}$ Case 2: $X_{k,n} \neq 0 \, and \, Y_{l,n} = 0 \,\,\rightarrow \frac{\nabla cdist(X,Y)_{k,l}}{\nabla X_{k,n}} = \frac{X_{k,n} }{\sqrt{\sum_j (X_{k,j} - Y_{l,j})^2}}$ Case 3: $X_{k,n} = 0 \, and \, Y_{l,n} = 0 \,\,\rightarrow \frac{\nabla cdist(X,Y)_{k,l}}{\nabla X_{k,n}} = 0$ Case 4: $X_{k,n} = 0 \, and \, Y_{l,n} \neq 0 \,\,\rightarrow \frac{\nabla cdist(X,Y)_{k,l}}{\nabla X_{k,n}} = \frac{-Y_{l,n} }{\sqrt{\sum_j (X_{k,j} - Y_{l,j})^2}}$

The problematic case is case 4! Here, when computing the derivatives, the sparsity of X will change because where X was zero before the gradient would introduce a new non zero value. This requires to also change the size and values of rowPtr and colIndex together with the value array of the CSR Matrix. Hence my question: How could this be solved?

Here is the autograd boilerplate i am currently using:

using torch::autograd::AutogradContext;
using torch::autograd::Variable;
using torch::autograd::variable_list;

class SparseCdist : public torch::autograd::Function<SparseCdist> {
public: static variable_list forward(
    AutogradContext *ctx,
    torch::Tensor a_rowptr_data,
    torch::Tensor a_col_data,
    torch::Tensor a_value_data,
    torch::Tensor b_rowptr_data,
    torch::Tensor b_col_data,
    torch::Tensor b_value_data,
    int dim_a,
    int dim_b
    ) {
    auto out = sparse_cdist(a_rowptr_data, a_col_data, a_value_data, b_rowptr_data, b_col_data, b_value_data, dim_a, dim_b);
    ctx->saved_data["dim_a"] = dim_a;
    ctx->saved_data["dim_b"] = dim_b;
    ctx->save_for_backward({a_rowptr_data, a_col_data, a_value_data, b_rowptr_data, b_col_data, b_value_data, out});
    return {out};
  }

  static variable_list backward(AutogradContext *ctx, variable_list grad_outs) {
    auto dim_a = ctx->saved_data["dim_a"].toInt();
    auto dim_b = ctx->saved_data["dim_b"].toInt();
    auto grad_out = grad_outs[0];
    auto saved = ctx->get_saved_variables();
    auto a_rowptr_data = saved[0], a_col_data = saved[1], a_value_data = saved[2], b_rowptr_data = saved[3],
         b_col_data = saved[4], b_value_data = saved[5], distance = saved[5];

    auto grad_value_a = Variable();
    if (torch::autograd::any_variable_requires_grad({a_value_data})){
      std::cout << "grad_outs is: " << grad_out;
      grad_value_a = sparse_bw_cdist(a_rowptr_data, a_col_data, a_value_data, b_rowptr_data, b_col_data, b_value_data, grad_out, distance, dim_a, dim_b);
    }

    auto grad_value_b = Variable();
    if (torch::autograd::any_variable_requires_grad({b_value_data})){
      std::cout << "grad_outs is: " << grad_out;
      grad_value_b = sparse_bw_cdist(b_rowptr_data, b_col_data, b_value_data,a_rowptr_data, a_col_data, a_value_data, grad_out, distance, dim_b, dim_a);
    }    
    return {Variable(), Variable(), grad_value_a,
            Variable(), Variable(), grad_value_b, Variable(), Variable()};
  }
};

What i would want to do is overwrite a_rowptr_data and a_col_data but it is unclear to me how this can be done using this AutogradContext concept. Help would be much appreciated, also if this starts to work would there be interest to integrate cdist it into this project?

KukumavMozolo commented 2 months ago

Maybe one workaround is to explicitly introduce zeros to X where X is implicitly is zero and Y is none zero?