Add benchmarks to CI - Githubissues

moaradwan commented 2 years ago

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Docs change / refactoring / dependency upgrade

Issue: https://github.com/pytorch/opacus/issues/368

Motivation and Context / Related issue

There's a task #368 for committing benchmark code. In this change I add these benchmarks into CI integration tests. To choose thresholds I ran the benchmarks locally on all the layers with (batch size: 16, num_runs: 100, num_repeats: 20, forward_only: False), please check the comment below for more details.

Using the report and section 3 in the paper, I parameterised the runtime and memory thresholds for different layers.

How Has This Been Tested (if it applies)

I ran the jobs locally and generated reports.
Local CircleCI config validation circleci config process .circleci/config.yml
Local CircleCI job run: circleci local execute --job JOB_NAME
Checklist

[X] The documentation is up-to-date with the changes I made.
[X] I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
[x] All tests passed, and additional code has been covered with new tests.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan commented 2 years ago

Results after running on GPU

The following table shows the memory and runtime metrics after running it on CircleCI using gpu.nvidia.small.multi

Group 1: groupnorm, instancenorm, layernorm, dpmha

Threshold based on paper:

runtime_ratio_threshold: "2.6"
memory_ratio_threshold: "1.6"
Succeeded
Pipeline runtime: 22s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
groupnorm	107520.0		140288.0	1.3047619047619048	0.00040995383000020535		0.000998464277499892	2.4355529926367363
instancenorm	6345728.0		7394304.0	1.1652412457640795	0.0005690672350000909		0.0012307228030001341	2.162701922207729
layernorm	28672.0		37888.0	1.3214285714285714	0.0003554899874999648		0.000726604483499955	2.043952035358034

Group 2: Linear layer

Threshold based on paper:

runtime_ratio_threshold: "3.6"
memory_ratio_threshold: "13"
:white_check_mark: Succeeded
Pipeline runtime: 14s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
linear	3283968.0		36903936.0	11.237605238540691	0.00041353099599993477		0.0010285112289999743	2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

runtime_ratio_threshold: "3.5"
memory_ratio_threshold: "2.0"
:white_check_mark: Succeeded
Pipeline runtime: 23s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
mha	13630464.0		24162304.0	1.772669220945083	0.0012178095074999362		0.0037577736325000903	3.0856826206050023

Group 4: GRU

Threshold based on paper:

runtime_ratio_threshold: "18.5"
memory_ratio_threshold: "1.5"
:no_entry_sign: FAILED
Pipeline runtime: 7m47s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
gru	11186176.0	12154368.0	1.0865525448553643	16603136.0	1.484254851702673	0.004054944954499915	0.05725822975399994	14.120593595347938	0.14559441694749992	35.905399106818614

Group 5: LSTM

Threshold based on paper:

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.2"
:no_entry_sign: FAILED
Pipeline runtime: 7m17s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
lstm	10801152.0	11527680.0	1.06726393629124	18021376.0	1.6684679560106181	0.004167011488000128	0.051514086188000296	12.362357612007312	0.13722742892250026	32.93185759570818

Group 6: RNN

Threshold based on paper:

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.5"
:no_entry_sign: FAILED
Pipeline runtime: 4m41s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
rnn	6287360.0	5936640.0	0.9442182410423453	6346240.0	1.0093648208469055	0.003437245790000362	0.020753186856500516	6.03773722463366	0.09805997271550064	28.52864726775629

Group 7: Embedding

Threshold based on paper:

runtime_ratio_threshold: "6.0"
memory_ratio_threshold: "15.0"
:white_check_mark: Succeeded
Pipeline runtime: 20s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
embedding	24021504.0		280028160.0	11.657394974103203	0.0004076045779994501		0.0023867599680010014	5.855576941052466

Open points

1. Reducing runtime of the jobs

Right now I only excluded conv layer since it takes up to an hour when run locally.

In total the new tasks increase the pipeline execution time by ~20 minutes, with recurrent layers taking most of that time. This makes the integration pipeline total execution time 32 minutes.

Some improvements:

Run the benchmark jobs in parallel.
Remove some of the layers.

2. Changing thresholds

Currently I used the paper to infer most of the highlights. The recurrent layer validation has failed though.

Group	Memory Threshold - Hi Memory	Runtime Threshold - Hi Runtime
4: GRU	:white_check_mark: 1.5, 1.48	:no_entry_sign: 18 , 35.9
5: LSTM	:no_entry_sign: 1.2, 1.668	:no_entry_sign: 16.5, 32.9
6: RNN	:white_check_mark: 1.5, 1.009	:no_entry_sign: 16.5, 28.528

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan commented 2 years ago

@ffuuugor @ashkan-software regarding point number 2 mentioned in https://github.com/pytorch/opacus/pull/481#issuecomment-1228317342 should I just update the threshold of the failing tests to let it pass?

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan commented 2 years ago

Final results after updating thresholds

The jobs will run separately under the name micro_benchmarks_py37_torch_release_cuda only under nightly. The whole run will take: ~27 minutes. There are 10 tasks as follows.

An example run is here.

Group 1: GSM of: (groupnorm, instancenorm, layernorm), and DPMHA

Threshold based on paper:

runtime_ratio_threshold: "2.6"
memory_ratio_threshold: "1.6"
:white_check_mark: Succeeded
Pipeline runtime: 22s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
groupnorm	107520.0		140288.0	1.3047619047619048	0.00040995383000020535		0.000998464277499892	2.4355529926367363
instancenorm	6345728.0		7394304.0	1.1652412457640795	0.0005690672350000909		0.0012307228030001341	2.162701922207729
layernorm	28672.0		37888.0	1.3214285714285714	0.0003554899874999648		0.000726604483499955	2.043952035358034
mha	13630464.0	13632512.0	1.00015025167155			0.0012336450990000003	0.001333286202000039	1.0807696663171673

Group 2: GSM-Linear layer

Threshold based on paper:

runtime_ratio_threshold: "3.6"
memory_ratio_threshold: "13"
:white_check_mark: Succeeded
Pipeline runtime: 14s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
linear	3283968.0		36903936.0	11.237605238540691	0.00041353099599993477		0.0010285112289999743	2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

runtime_ratio_threshold: "3.5"
memory_ratio_threshold: "2.0"
:white_check_mark: Succeeded
Pipeline runtime: 23s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
mha	13630464.0		24162304.0	1.772669220945083	0.0012178095074999362		0.0037577736325000903	3.0856826206050023

Group 4&5: DPGRU and GSM-DPGRU

DPGRU:

runtime_ratio_threshold: "18.5"
memory_ratio_threshold: "1.2"
:white_check_mark: Succeeded
Pipeline runtime: 2m28s

GSM-DPGRU:
runtime_ratio_threshold: "40"
memory_ratio_threshold: "1.6"
:white_check_mark: Succeeded
Pipeline runtime: 5m42s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
gru	11186176.0	12154368.0	1.0865525448553643	16603136.0	1.484254851702673	0.004054944954499915	0.05725822975399994	14.120593595347938	0.14559441694749992	35.905399106818614

Group 6&7: DLSTM and GSM-DPLSTM

DLSTM

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.2"
:white_check_mark: Succeeded
Pipeline runtime: 2m12s

GSMDLSTM

runtime_ratio_threshold: "38"
memory_ratio_threshold: "1.8"
:white_check_mark: Succeeded
Pipeline runtime: 5m8s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
lstm	10801152.0	11527680.0	1.06726393629124	18021376.0	1.6684679560106181	0.004167011488000128	0.051514086188000296	12.362357612007312	0.13722742892250026	32.93185759570818

Group 8&9: DPRNN and GSM-DPRNN

DPRNN:

runtime_ratio_threshold: "10"
memory_ratio_threshold: "1.2"
:white_check_mark: Succeeded
Pipeline runtime: 1m4s

GSM-DPRNN:

runtime_ratio_threshold: "33"
memory_ratio_threshold: "1.2"
:white_check_mark: Succeeded
Pipeline runtime: 3m44s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
rnn	6287360.0	5936640.0	0.9442182410423453	6346240.0	1.0093648208469055	0.003437245790000362	0.020753186856500516	6.03773722463366	0.09805997271550064	28.52864726775629

Group 10: Embedding

Threshold based on paper:

runtime_ratio_threshold: "6.0"
memory_ratio_threshold: "15.0"
:white_check_mark: Succeeded
Pipeline runtime: 20s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
embedding	24021504.0		280028160.0	11.657394974103203	0.0004076045779994501		0.0023867599680010014	5.855576941052466

facebook-github-bot commented 2 years ago

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot commented 2 years ago

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan commented 2 years ago

@ffuuugor I updated the code as follows:

Split the tasks into different job.
Update thresholds PS: these are ratios and not absolute values.
Different tasks for DP and GSM for recurrent layers to have tighter thresholds.
Only run on nightly.

Consider the comment above for more details.

pytorch / opacus

Add benchmarks to CI #481

Types of changes

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Checklist

Results after running on GPU

Group 1: groupnorm, instancenorm, layernorm, dpmha

Group 2: Linear layer

Group 3: GSM-DPMHA

Group 4: GRU

Group 5: LSTM

Group 6: RNN

Group 7: Embedding

Open points

1. Reducing runtime of the jobs

2. Changing thresholds

Final results after updating thresholds

Group 1: GSM of: (groupnorm, instancenorm, layernorm), and DPMHA

Group 2: GSM-Linear layer

Group 3: GSM-DPMHA

Group 4&5: DPGRU and GSM-DPGRU

Group 6&7: DLSTM and GSM-DPLSTM

Group 8&9: DPRNN and GSM-DPRNN

Group 10: Embedding