[GPU/OpenCL] Added fp16 support for FC layer on GPU

nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.

Apache License 2.0

144 stars 73 forks source link

[GPU/OpenCL] Added fp16 support for FC layer on GPU #2609

Closed s-debadri closed 3 months ago

s-debadri commented 4 months ago

FC Layer GPU kernels added for fp16 operation:

Added blas_kernels_fp16.cpp for BLAS fp16 OpenCL kernels.
Used lda for SGEMV computation for generalization.
Unit tests added for FC Layer fp16 support on GPU.

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar s.debadri@samsung.com

taos-ci commented 4 months ago

:memo: TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2609. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

s-debadri commented 4 months ago

It's good to read your contributions in GPU enablement. One quick question. Do you have a plan to further improve the kernels? e.g., sgemv_cl_kernel's parallel level is one thread per one component of out vector, which can be further parallelized. It would be great to know the current speed-up status compared to CPU.

Yes kernels will be further improved going forward depending on the extent of optimizations we can achieve. Currently we are focusing on implementing the initial skeleton of running LLM on GPU.

skykongkong8 commented 3 months ago

Please identify the changes in blas_kernels.cpp before merging. It appears unrelated to other changes.

PTAL: @skykongkong8 @lhs8928

It was one of my suggestions to use terms like lda, ldb, or ldc from previous reviews, although it might have been better to separate feature-implementation commit and bugfix commit. I could confirm current implementation is more desirable than before

jijoongmoon commented 3 months ago

Not this PR, blas_kernel code needs to be under the tensor directory for better maintenance.