tensor-compiler / taco

The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs
http://tensor-compiler.org
Other
1.26k stars 189 forks source link

Broken Sparse Matrix Multiplication #121

Open nitish2112 opened 7 years ago

nitish2112 commented 7 years ago

Hi,

When I try to compile a matrix multiplication A = B*C where A, B and C all three are CSR, taco doesn't seem to produce the correct result.

#include <iostream>
#include "taco.h"

using namespace taco;

/*
B:  1  0  0  0
    0  0  2  3
    0  0  0  0

C:  0  4
    5  0
    0  0
    0  0

A:  0  4
    0  0
    0  0
*/

int main(int argc, char* argv[]) {
  // Create formats
  Format csr({Dense,Sparse}, {0,1});
  Format csc({Dense,Sparse}, {1,0});

  // Create tensors
  Tensor<double> A({3,2},   csr);
  Tensor<double> B({3,4},   csr);
  Tensor<double> C({4,2},   csr);

  // Insert data into B and C
  B.insert({0,0}, 1.0);
  B.insert({1,2}, 2.0);
  B.insert({1,3}, 3.0);
  C.insert({0,1}, 4.0);
  C.insert({1,0}, 5.0);

  // Pack data as described by the formats
  B.pack();
  C.pack();
  // Form a matrix multiplication expression
  IndexVar i, j, k;
  A(i,j) = B(i,k) * C(k,j);

  // Compile the expression
  A.compile();

  // Assemble A's indices and numerically compute the result
  A.assemble();
  A.compute();

  std::cout << A << std::endl;
}

When I compile it using taco and run:

% cd apps/matrix_multiplication_csr_csr_csr/build
% cmake .. && make
% ./matrix_multiplication_csr_csr_csr

A0 (3x2) (dense,sparse; 0,1):
dense (0):
  [3]
sparse (1):
  [0, 1, 1, 0]
  []
[]

which is the wrong result.

Also, when I try to set A as CSR, B as CSR and C as CSC, then I get the following result

% cd apps/matrix_multiplication_csr_csr_csc/build
% cmake .. && make
% ./matrix_multiplication_csr_csr_csc
dense (0):
  [3]
sparse (1):
  [0, 2, 4, 6]
  [0, 1, 0, 1, 0, 1]
[0, 4, 0, 0, 0, 0]

Here as we can see A is definitely not is CSR format as there are still zeros in the vals array.

Is there something that I am doing wrong or these are bugs?

Thanks,

Best Regards, Nitish

fredrikbk commented 7 years ago

Dear @nitish2112,

I'm so sorry, the current implementation does produce incorrect code for CSR matrix multiplication. We've worked out how to fix it in theory and I'll implement it first thing after thanksgiving break. (I look forward to code again!) I thought we had an issue on this, but I cannot find it so I'll update the name of this issue.

This is roughly the code we want for compute:

double* restrict w = malloc(A2_size*sizeof(double));
memset(w, 0, A2_size*sizeof(double));

for (int i = 0; i < B0_size; i++) {
  for (int pB = B_pos[i]; pB < B_pos[i+1]; pB++) {
    int k = B_idx[pB];
    for (int pC = C_pos[k]; pC < C_pos[k+1]; pC++) {
      int j = C_idx[pC];
      w[j] += B_vals[pB] * C_vals[pC];
    }
  }
  for (int pA = A_pos[i]; pA < A_pos[i+1]; pA++) {
    int j = A_idx[pA];
    A_vals[pA] = w[j];
    w[j] = 0.0;
  }
}

free(w);

For assembly:

bool* restrict w = malloc(A1_size*sizeof(bool));
memset(w, 0, A1_size*sizeof(bool));
int* restrict wlist = malloc(A1_size*sizeof(int));
int w_size = 0;

int A_idx_size = 1048576;
A_pos = malloc((m+1)*sizeof(int));
A_idx = malloc(A_idx_size*sizeof(int));

A_pos[0] = 0;
for (int i = 0; i < B0_size; i++) {
  for (int pB = B_pos[i]; pB < B_pos[i+1]; pB++) {
    int k = B_idx[pB];
    for (int pC = C_pos[k]; pC < C_pos[k+1]; pC++) {
      int j = C_idx[pC];
      if (!w[j]) {
        wlist[w_size++] = j;
        w[j] = true;
      }
    }
  }

  // Make sure A_idx is large enough
  if (A_idx_size < (A_pos[i]+w_size)) {
    A_idx_size *= 2;
    A_idx = realloc(A_idx, A_idx_size*sizeof(int));
  }

  // Sort row
  qsort(wlist, w_size, sizeof(int), cmp);

  for (int pwlist = 0; pwlist < w_size; pwlist++) {
    int j = wlist[pwlist];
    w[j] = false;
    A_idx[A_pos[i] + pwlist] = j;
  }
  A_pos[i+1] = A_pos[i] + w_size;
  w_size = 0;
}
A_vals = malloc(A_pos[m] * sizeof(double));

free(w);
free(wlist);
nitish2112 commented 6 years ago

Hi Fred, Has this bug been resolved?

fredrikbk commented 6 years ago

Hi, thanks for checking in. I'm sorry, but it hasn't been implemented yet since I diverted by small side-tasks, but I'm working on it and there's nothing blocking it so it should be in by the end of the month/early next month. I hope this doesn't negatively impact you too much.

nitish2112 commented 6 years ago

Hi Fred,

Thanks for the prompt reply. Yes, that is perfectly fine.

Best, Nitish

NIC0NIC0NI commented 6 years ago

Dear @nitish2112,

I'm so sorry, the current implementation does produce incorrect code for CSR matrix multiplication. We've worked out how to fix it in theory and I'll implement it first thing after thanksgiving break. (I look forward to code again!) I thought we had an issue on this, but I cannot find it so I'll update the name of this issue.

This is roughly the code we want for compute:

double* restrict w = malloc(A2_size*sizeof(double));
memset(w, 0, A2_size*sizeof(double));

for (int i = 0; i < B0_size; i++) {
  for (int pB = B_pos[i]; pB < B_pos[i+1]; pB++) {
    int k = B_idx[pB];
    for (int pC = C_pos[k]; pC < C_pos[k+1]; pC++) {
      int j = C_idx[pC];
      w[j] += B_vals[pB] * C_vals[pC];
    }
  }
  for (int pA = A_pos[i]; pA < A_pos[i+1]; pA++) {
    int j = A_idx[pA];
    A_vals[pA] = w[j];
    w[j] = 0.0;
  }
}

free(w);

For assembly:

bool* restrict w = malloc(A1_size*sizeof(bool));
memset(w, 0, A1_size*sizeof(bool));
int* restrict wlist = malloc(A1_size*sizeof(int));
int w_size = 0;

int A_idx_size = 1048576;
A_pos = malloc((m+1)*sizeof(int));
A_idx = malloc(A_idx_size*sizeof(int));

A_pos[0] = 0;
for (int i = 0; i < B0_size; i++) {
  for (int pB = B_pos[i]; pB < B_pos[i+1]; pB++) {
    int k = B_idx[pB];
    for (int pC = C_pos[k]; pC < C_pos[k+1]; pC++) {
      int j = C_idx[pC];
      if (!w[j]) {
        wlist[w_size++] = j;
        w[j] = true;
      }
    }
  }

  // Make sure A_idx is large enough
  if (A_idx_size < (A_pos[i]+w_size)) {
    A_idx_size *= 2;
    A_idx = realloc(A_idx, A_idx_size*sizeof(int));
  }

  // Sort row
  qsort(wlist, w_size, sizeof(int), cmp);

  for (int pwlist = 0; pwlist < w_size; pwlist++) {
    int j = wlist[pwlist];
    w[j] = false;
    A_idx[A_pos[i] + pwlist] = j;
  }
  A_pos[i+1] = A_pos[i] + w_size;
  w_size = 0;
}
A_vals = malloc(A_pos[m] * sizeof(double));

free(w);
free(wlist);

This hash-based approach may have compatibility issue with the existing merge-based approach for sparse vector add. I suggest to use the k-way merging, where k is the nnz of B(i,:). For CSR, You can use the sparse vector add code of A(i,:) = B(i,k_0) C(k_0,:) + B(i,k_1) C(k_1,:) + ... + B(i,k_nz_i) * C(k_nz_i,:)