nnstreamer / nntrainer

NNtrainer is Software Framework for Training Neural Network Models on Devices.
Apache License 2.0
134 stars 71 forks source link

[ Model ] Enable Mixed Precision Training #2628

Closed jijoongmoon closed 3 weeks ago

jijoongmoon commented 3 weeks ago

In this PR

This PR modifies codes related to Mixed Precision Training.

Commits to be reviewed in this PR

[ Model ] Fix the gradient clipping for the FP16 or Low bit Gradient
In this PR, when we compute the l2norm of gradient tensor, it converts to full precsion and computes the l2norm for gradient clipping. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon

[ Layer ] Add mu and var backup up tensor
This PR add the mu and var backup tensor ( mu_b, var_b ) to restore the previous moving mean and moving variance for mixed precsion training. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon

[ Layer ] prevent randomize when it restore the data
In order to resotore previous iteration data, this pr disable randomnization of mask if it need restore previous data. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon

[ Context ] add check if it needs restore previous data
This PR enable the check if it need restore previous data. By doing this, we can remove the NaN or Inf data in Tensor for the mixed precsion training. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon

[ Tensor ] remove sscal to set zero.
We do need to remove the Nan or Inf value in Tensor by call setZero(). However, if we using sscal, then Nan or Inf values are remain still. This PR change the sscal to memset. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon

taos-ci commented 3 weeks ago

:memo: TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2628. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci commented 3 weeks ago

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2628-202406092007560.038749933242798-88d84e4429ccd6956521c9e33de600525ccc8aff/.

taos-ci commented 3 weeks ago

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2628-202406100820130.94758605957031-8e06368387284d1d3ec5cdb8e272946fe06d2ff8/.

taos-ci commented 3 weeks ago

:octocat: cibot: @jijoongmoon, A builder checker could not be completed because one of the checkers is not completed. In order to find out a reason, please go to http://ci.nnstreamer.ai/nntrainer/ci/repo-workers/pr-checker/2628-202406100915430.15493392944336-54bd73dbced2c88ca8789840d9151aa7245e3746/.