uncomplicate / neanderthal

Fast Clojure Matrix Library
http://neanderthal.uncomplicate.org
Eclipse Public License 1.0
1.06k stars 56 forks source link

Sum operation in CUDA versus MKL #65

Closed jonesn closed 5 years ago

jonesn commented 5 years ago

Sum operation in CUDA versus MKL

Hi Dragan, It looks like the sum operation is only taking the first 32 rows or cols of the Matrix in CUDA.

Native seems fine. Examples below.

Native

(ns nz.co.arachnid.scorpion.matrixnative
  (:use [uncomplicate.neanderthal core native]))

;; Intel MKL Version

(defn large-square-matrix-mult-native
  "Demo function to do large square Matrix multiplications and time them."
  [n]
  (let [cnt        n
        matrix-a   (fge cnt cnt (repeat 1))
        matrix-b   (copy matrix-a)]
    (time
      (let [result (mm 1.0 matrix-a matrix-b)]
          ;; Return the Matrix and the sum of its elements
          {:sum-of-elements   (rationalize (sum result))
           :matrix            result}))))

(comment
  (large-square-matrix-mult-native 8) ;; 512
  (large-square-matrix-mult-native 16) ;; 4096
  (large-square-matrix-mult-native 32) ;; 32768
  (large-square-matrix-mult-native 64) ;; 262144 = 64 * 64 * 64
  (large-square-matrix-mult-native 128)) ;; 2097152 = 128 * 128 * 128

CUDA

(ns nz.co.arachnid.scorpion.matrixcuda
  (:require [uncomplicate.clojurecuda.core :refer :all])
  (:use [uncomplicate.neanderthal core cuda]))

(init)

;; =======================
;; Large Matrix Operations
;; =======================

(defn large-square-matrix-mult-cuda
  "Demo function to do large square Matrix multiplications and time them."
  [n]
  (with-default
    (with-default-engine
      (let [cnt        n
            matrix-a   (cuge cnt cnt (repeat 1))
            matrix-b   (copy matrix-a)]
        (time
          (let [result (mm 1.0 matrix-a matrix-b)]
            ;; Return the Matrix and the sum of its elements
            {:sum-of-elements       (rationalize (sum result))
             :matrix                result}))))))

(comment
  (large-square-matrix-mult-cuda 8) ;; 512
  (large-square-matrix-mult-cuda 16) ;; 4096
  (large-square-matrix-mult-cuda 32) ;; 32768
  (large-square-matrix-mult-cuda 64) ;; 131072 = 32 * 64 * 64
  (large-square-matrix-mult-cuda 128)) ;; 524288 = 32 * 128 * 128
blueberry commented 5 years ago

Hi Nick,

Thank you for reporting this. This bug was fixed two weeks ago. Are you using the latest version (0.25.3)?

Here's the output I've just got from running the code in my REPL:

 (large-square-matrix-mult-cuda 128)
"Elapsed time: 0.596927 msecs"
{:sum-of-elements 2097152N,
 :matrix #CUGEMatrix[float, mxn:128x128, layout:column, offset:0]}

BTW, unrelated to this: your function leaks memory. Always use with-release for matrix and vector objects when you want to do let, especially for GPU memory.

BBTW sum on the CPU is not provided by MKL, but is a Clojure code provided for convenience. If you know that all the elements are positive, asum is much faster.

jonesn commented 5 years ago

Hi,

Yes a bump from 24 -> 25.3 corrects this. And thanks for the tips on usage.