test the running time of loop version of qr_conditions

PhrygianGates commented 1 year ago

The original qr_conditions is

function qr_conditions(A, x)
  return vcat(
    vec(UpperTriangular(Q' * Q) + LowerTriangular(R) - I - Diagonal(R)),
    vec(Q * R - A),
  )
end

However, the test shows that its much slower than the following for loop version

function qr_conditions_dev(A, x)
  (; Q, R) = x
  m, n = size(A, 1), size(A, 2)
  res_A = copy(A)
  res_R = Matrix{eltype(A)}(I, n, n)
  for i in 1:m
    for j in 1:n
      for k in 1:n
        res_A[i, j] -= Q[i, k] * R[k, j]
        if i == 1 && j < k                                                                                                     
          res_R[k, j] += R[k, j]
        end
        if j <= k
          res_R[j, k] -= Q[i, k] * Q[i, j]
        end
      end
    end
  end
  return vcat(vec(res_A), vec(res_R))
end

In terms of size(5, 3) matrices, that former one is

2.311390 seconds (5.85 M allocations: 266.937 MiB, 6.47% gc time, 99.81% compilation time)

while the latter one is

0.023781 seconds (2.46 k allocations: 131.417 KiB, 98.84% compilation time)

The reason is that the fomer one access each elements of the matrices for many times, which is redundant. So, we consider rewrite these conditions. But we currently use Zygote as our AD system, which doesn't support in-place operation. And that's the problem to be solved.

mohamed82008 commented 1 year ago

A few comments:

How did you time the functions? Post the script here. Make sure to use BenchmarkTools.
You can avoid the vcat in your implementation by allocating a single result vector and making views out of them then reshape each view into a matrix. Something like this reshape(@view(res[1:m*n]), m, n). This will share the memory with res. You should be able to have a single allocation only when you use BenchmarkTools.

codecov-commenter commented 1 year ago

Codecov Report

Merging #11 (c7ee211) into main (69313bd) will decrease coverage by 20.12%. The diff coverage is 0.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff             @@
##             main      #11       +/-   ##
===========================================
- Coverage   98.97%   78.86%   -20.12%     
===========================================
  Files           1        1               
  Lines          98      123       +25     
===========================================
  Hits           97       97               
- Misses          1       26       +25

Impacted Files	Coverage Δ
src/DifferentiableFactorizations.jl	`78.86% <0.00%> (-20.12%)`	:arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

mohamed82008 commented 1 year ago

Also please turn the notebook into a test file in the tests folder.

PhrygianGates commented 1 year ago

I used the @time from Julia original package. Now I turn to BenchmarkTools and only use one allocation, but the result shows that for loop version is much slower than the matrix multiplication version. I guess it's because the matrix multiplication is much more optimized than our simple for loop?

using DifferentiableFactorizations, LinearAlgebra, BenchmarkTools
A = rand(5, 3)
x = diff_qr(A)
(; Q, R) = x

@benchmark begin
    vcat(
        vec(UpperTriangular(Q' * Q) + LowerTriangular(R) - I - Diagonal(R)),
        vec(Q * R - A),
    )
end

@benchmark begin
  m, n = size(A, 1), size(A, 2)
  res = similar(A, m + n, n)
  res_A = view(res, 1:m, :)
  res_R = view(res, m+1:m+n, :)
  copyto!(res_A, A)
  fill!(res_R, 0)
  for i in 1:min(m, n)
    res_R[i, i] = 1
  end
  for i in 1:m
    for j in 1:n
      for k in 1:n
        res_A[i, j] -= Q[i, k] * R[k, j]
        if i == 1 && j < k
          res_R[k, j] += R[k, j]
        end
        if j <= k
          res_R[j, k] -= Q[i, k] * Q[i, j]
        end
      end
    end
  end
  vec(res)
end

I'll merge it into test later. Thanks!

mohamed82008 commented 1 year ago

The code you benchmark should always be inside a function. Also try multiple matrix sizes.

PhrygianGates commented 1 year ago

Yeah, this time I only test the function. Besides I tried different sizes, and found that our for loop implementation is faster than the matrix multiplication version when the size is small, and slower in the cases of large matrices. These are two typical tests and there are some more tests in the notebook script. Do you think it's reasonable?

mohamed82008 / DifferentiableFactorizations.jl

test the running time of loop version of qr_conditions #11

Codecov Report