[Performance, Refactor, BugFix] Faster loading of uninitialized storages

vmoens commented 1 month ago

cc @teopir

cc @shagunsodhani this is a good example of prealloc with tensordict. We were using a lot of lazy stacks and stacking at the last minute. Using a preallocated TD instead (create an empty td -> get a bunch of views of that td -> write on the first view, and all views get instantiated instantaneously) made the whole thing 20 - 1000x faster!

pytorch-bot[bot] commented 1 month ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2221

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures, 2 Unrelated Failures

As of commit 934f48cd20c1053c86bf5ac0847f970a46741a12 with merge base 166467a6cf06f0e161f2f86c92549499db9c7899 ():

NEW FAILURES - The following jobs have failed:

* [Habitat Tests on Linux / tests (3.9, 12.1) / linux-job](https://hud.pytorch.org/pr/pytorch/rl/2221#26062936978) ([gh](https://github.com/pytorch/rl/actions/runs/9461676962/job/26062936978)) `RuntimeError: Command docker exec -t c65ffe47437af873d5786083febb9daddae0f54f0760269ebe69c4d737442ecb /exec failed with exit code 139` * [Unit-tests on Linux / tests-optdeps (3.10, 12.1) / linux-job](https://hud.pytorch.org/pr/pytorch/rl/2221#26062961741) ([gh](https://github.com/pytorch/rl/actions/runs/9461676979/job/26062961741)) `RuntimeError: Command docker exec -t 46defc967ca30c7360abc295275e87cd63083265009dafefbfa887a113c031a5 /exec failed with exit code 1` * [Unit-tests on Windows / unittests-cpu / windows-job](https://hud.pytorch.org/pr/pytorch/rl/2221#26062938991) ([gh](https://github.com/pytorch/rl/actions/runs/9461676966/job/26062938991)) `The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128`

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

* [Libs Tests on Linux / unittests-gym (3.9, 12.1) / linux-job](https://hud.pytorch.org/pr/pytorch/rl/2221#26062957110) ([gh](https://github.com/pytorch/rl/actions/runs/9461676960/job/26062957110)) ([trunk failure](https://hud.pytorch.org/pytorch/rl/commit/166467a6cf06f0e161f2f86c92549499db9c7899#26022941493)) `##[error]The operation was canceled.` * [Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job](https://hud.pytorch.org/pr/pytorch/rl/2221#26062961408) ([gh](https://github.com/pytorch/rl/actions/runs/9461676979/job/26062961408)) ([trunk failure](https://hud.pytorch.org/pytorch/rl/commit/166467a6cf06f0e161f2f86c92549499db9c7899#26022940333)) `##[error]The operation was canceled.`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions[bot] commented 1 month ago

$\color{#D29922}\textsf{\Large\⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 91. Improved: $\large\color{#35bf28}11$. Worsened: $\large\color{#d91a1a}5$.

Expand to view detailed results

| Name | Max | Mean | Ops | Ops on Repo `HEAD` | Change | | ----------------------------------------------------------------------------------------- | --------- | --------- | --------------- | ------------------ | ----------------------------------- | | test_single | 0.1092s | 57.8620ms | 17.2825 Ops/s | 17.8609 Ops/s | $\color{#d91a1a}-3.24\\%$ | | test_sync | 40.5347ms | 34.6504ms | 28.8597 Ops/s | 32.1993 Ops/s | $\textbf{\color{#d91a1a}-10.37\\%}$ | | test_async | 58.3820ms | 29.3221ms | 34.1040 Ops/s | 35.4492 Ops/s | $\color{#d91a1a}-3.79\\%$ | | test_simple | 0.4386s | 0.3810s | 2.6249 Ops/s | 2.6591 Ops/s | $\color{#d91a1a}-1.29\\%$ | | test_transformed | 0.5818s | 0.5352s | 1.8685 Ops/s | 1.8462 Ops/s | $\color{#35bf28}+1.21\\%$ | | test_serial | 1.2906s | 1.2341s | 0.8103 Ops/s | 0.7900 Ops/s | $\color{#35bf28}+2.57\\%$ | | test_parallel | 1.1262s | 1.0652s | 0.9388 Ops/s | 0.9392 Ops/s | $\color{#d91a1a}-0.04\\%$ | | test_step_mdp_speed[True-True-True-True-True] | 74.4860μs | 21.3683μs | 46.7982 KOps/s | 45.0420 KOps/s | $\color{#35bf28}+3.90\\%$ | | test_step_mdp_speed[True-True-True-True-False] | 46.2970μs | 13.0328μs | 76.7294 KOps/s | 74.6741 KOps/s | $\color{#35bf28}+2.75\\%$ | | test_step_mdp_speed[True-True-True-False-True] | 33.5430μs | 12.7415μs | 78.4836 KOps/s | 78.0207 KOps/s | $\color{#35bf28}+0.59\\%$ | | test_step_mdp_speed[True-True-True-False-False] | 46.2170μs | 7.6698μs | 130.3807 KOps/s | 127.8736 KOps/s | $\color{#35bf28}+1.96\\%$ | | test_step_mdp_speed[True-True-False-True-True] | 51.3670μs | 22.9853μs | 43.5061 KOps/s | 42.9170 KOps/s | $\color{#35bf28}+1.37\\%$ | | test_step_mdp_speed[True-True-False-True-False] | 50.3350μs | 14.2544μs | 70.1539 KOps/s | 67.9458 KOps/s | $\color{#35bf28}+3.25\\%$ | | test_step_mdp_speed[True-True-False-False-True] | 42.0690μs | 13.9407μs | 71.7323 KOps/s | 70.8762 KOps/s | $\color{#35bf28}+1.21\\%$ | | test_step_mdp_speed[True-True-False-False-False] | 44.2640μs | 8.9339μs | 111.9331 KOps/s | 109.6311 KOps/s | $\color{#35bf28}+2.10\\%$ | | test_step_mdp_speed[True-False-True-True-True] | 54.7230μs | 24.3320μs | 41.0981 KOps/s | 40.4271 KOps/s | $\color{#35bf28}+1.66\\%$ | | test_step_mdp_speed[True-False-True-True-False] | 53.4900μs | 15.6971μs | 63.7061 KOps/s | 61.6495 KOps/s | $\color{#35bf28}+3.34\\%$ | | test_step_mdp_speed[True-False-True-False-True] | 50.9560μs | 14.1277μs | 70.7828 KOps/s | 70.0807 KOps/s | $\color{#35bf28}+1.00\\%$ | | test_step_mdp_speed[True-False-True-False-False] | 33.7940μs | 8.9734μs | 111.4408 KOps/s | 109.5249 KOps/s | $\color{#35bf28}+1.75\\%$ | | test_step_mdp_speed[True-False-False-True-True] | 60.0630μs | 25.4968μs | 39.2205 KOps/s | 38.6252 KOps/s | $\color{#35bf28}+1.54\\%$ | | test_step_mdp_speed[True-False-False-True-False] | 44.7740μs | 16.9516μs | 58.9914 KOps/s | 57.1577 KOps/s | $\color{#35bf28}+3.21\\%$ | | test_step_mdp_speed[True-False-False-False-True] | 50.3650μs | 15.1156μs | 66.1567 KOps/s | 65.5584 KOps/s | $\color{#35bf28}+0.91\\%$ | | test_step_mdp_speed[True-False-False-False-False] | 33.5730μs | 10.0547μs | 99.4557 KOps/s | 95.7110 KOps/s | $\color{#35bf28}+3.91\\%$ | | test_step_mdp_speed[False-True-True-True-True] | 60.4630μs | 24.2469μs | 41.2423 KOps/s | 40.6127 KOps/s | $\color{#35bf28}+1.55\\%$ | | test_step_mdp_speed[False-True-True-True-False] | 58.5790μs | 15.5458μs | 64.3260 KOps/s | 61.9441 KOps/s | $\color{#35bf28}+3.85\\%$ | | test_step_mdp_speed[False-True-True-False-True] | 42.6800μs | 16.2754μs | 61.4425 KOps/s | 60.8697 KOps/s | $\color{#35bf28}+0.94\\%$ | | test_step_mdp_speed[False-True-True-False-False] | 45.6050μs | 10.0666μs | 99.3385 KOps/s | 95.9511 KOps/s | $\color{#35bf28}+3.53\\%$ | | test_step_mdp_speed[False-True-False-True-True] | 57.1410μs | 25.2579μs | 39.5915 KOps/s | 38.5319 KOps/s | $\color{#35bf28}+2.75\\%$ | | test_step_mdp_speed[False-True-False-True-False] | 41.8790μs | 16.8572μs | 59.3219 KOps/s | 57.5940 KOps/s | $\color{#35bf28}+3.00\\%$ | | test_step_mdp_speed[False-True-False-False-True] | 51.5070μs | 17.2917μs | 57.8312 KOps/s | 56.4563 KOps/s | $\color{#35bf28}+2.44\\%$ | | test_step_mdp_speed[False-True-False-False-False] | 68.8250μs | 11.2917μs | 88.5604 KOps/s | 85.3813 KOps/s | $\color{#35bf28}+3.72\\%$ | | test_step_mdp_speed[False-False-True-True-True] | 65.3530μs | 26.9352μs | 37.1261 KOps/s | 36.9008 KOps/s | $\color{#35bf28}+0.61\\%$ | | test_step_mdp_speed[False-False-True-True-False] | 41.8580μs | 18.1241μs | 55.1753 KOps/s | 53.4923 KOps/s | $\color{#35bf28}+3.15\\%$ | | test_step_mdp_speed[False-False-True-False-True] | 56.7600μs | 17.4686μs | 57.2455 KOps/s | 56.5296 KOps/s | $\color{#35bf28}+1.27\\%$ | | test_step_mdp_speed[False-False-True-False-False] | 34.5250μs | 11.3576μs | 88.0468 KOps/s | 85.9283 KOps/s | $\color{#35bf28}+2.47\\%$ | | test_step_mdp_speed[False-False-False-True-True] | 42.0190μs | 28.3084μs | 35.3251 KOps/s | 26.8601 KOps/s | $\textbf{\color{#35bf28}+31.52\\%}$ | | test_step_mdp_speed[False-False-False-True-False] | 58.3600μs | 19.3066μs | 51.7958 KOps/s | 50.7974 KOps/s | $\color{#35bf28}+1.97\\%$ | | test_step_mdp_speed[False-False-False-False-True] | 53.2900μs | 18.2481μs | 54.8002 KOps/s | 53.2239 KOps/s | $\color{#35bf28}+2.96\\%$ | | test_step_mdp_speed[False-False-False-False-False] | 50.8250μs | 12.5588μs | 79.6253 KOps/s | 77.8421 KOps/s | $\color{#35bf28}+2.29\\%$ | | test_values[generalized_advantage_estimate-True-True] | 12.0578ms | 9.6984ms | 103.1097 Ops/s | 106.5470 Ops/s | $\color{#d91a1a}-3.23\\%$ | | test_values[vec_generalized_advantage_estimate-True-True] | 37.2958ms | 33.5827ms | 29.7772 Ops/s | 28.2439 Ops/s | $\textbf{\color{#35bf28}+5.43\\%}$ | | test_values[td0_return_estimate-False-False] | 0.2204ms | 0.1691ms | 5.9127 KOps/s | 5.5624 KOps/s | $\textbf{\color{#35bf28}+6.30\\%}$ | | test_values[td1_return_estimate-False-False] | 24.4367ms | 23.8635ms | 41.9050 Ops/s | 42.1385 Ops/s | $\color{#d91a1a}-0.55\\%$ | | test_values[vec_td1_return_estimate-False-False] | 34.2890ms | 33.5054ms | 29.8460 Ops/s | 28.1775 Ops/s | $\textbf{\color{#35bf28}+5.92\\%}$ | | test_values[td_lambda_return_estimate-True-False] | 37.1753ms | 34.0718ms | 29.3497 Ops/s | 29.1320 Ops/s | $\color{#35bf28}+0.75\\%$ | | test_values[vec_td_lambda_return_estimate-True-False] | 34.3717ms | 33.5269ms | 29.8268 Ops/s | 28.1350 Ops/s | $\textbf{\color{#35bf28}+6.01\\%}$ | | test_gae_speed[generalized_advantage_estimate-False-1-512] | 10.8436ms | 8.5277ms | 117.2655 Ops/s | 120.3497 Ops/s | $\color{#d91a1a}-2.56\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-True-1-512] | 2.1265ms | 1.8685ms | 535.1877 Ops/s | 515.9436 Ops/s | $\color{#35bf28}+3.73\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-False-1-512] | 0.4341ms | 0.3516ms | 2.8445 KOps/s | 2.8863 KOps/s | $\color{#d91a1a}-1.45\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-True-32-512] | 45.2625ms | 44.2174ms | 22.6155 Ops/s | 21.6929 Ops/s | $\color{#35bf28}+4.25\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-False-32-512] | 3.5937ms | 3.0399ms | 328.9621 Ops/s | 330.8133 Ops/s | $\color{#d91a1a}-0.56\\%$ | | test_dqn_speed | 1.8057ms | 1.3137ms | 761.2101 Ops/s | 739.7779 Ops/s | $\color{#35bf28}+2.90\\%$ | | test_ddpg_speed | 3.0697ms | 2.8032ms | 356.7342 Ops/s | 348.4421 Ops/s | $\color{#35bf28}+2.38\\%$ | | test_sac_speed | 9.5437ms | 8.3002ms | 120.4794 Ops/s | 115.6984 Ops/s | $\color{#35bf28}+4.13\\%$ | | test_redq_speed | 13.8142ms | 13.2296ms | 75.5881 Ops/s | 76.1562 Ops/s | $\color{#d91a1a}-0.75\\%$ | | test_redq_deprec_speed | 15.4172ms | 13.3708ms | 74.7896 Ops/s | 74.7745 Ops/s | $\color{#35bf28}+0.02\\%$ | | test_td3_speed | 8.4256ms | 8.2086ms | 121.8233 Ops/s | 117.4037 Ops/s | $\color{#35bf28}+3.76\\%$ | | test_cql_speed | 37.7098ms | 36.4579ms | 27.4289 Ops/s | 27.3792 Ops/s | $\color{#35bf28}+0.18\\%$ | | test_a2c_speed | 8.1285ms | 7.4553ms | 134.1320 Ops/s | 134.3177 Ops/s | $\color{#d91a1a}-0.14\\%$ | | test_ppo_speed | 9.1172ms | 7.7171ms | 129.5831 Ops/s | 129.9359 Ops/s | $\color{#d91a1a}-0.27\\%$ | | test_reinforce_speed | 7.3752ms | 6.6290ms | 150.8530 Ops/s | 150.5917 Ops/s | $\color{#35bf28}+0.17\\%$ | | test_iql_speed | 33.7329ms | 32.6757ms | 30.6038 Ops/s | 30.5080 Ops/s | $\color{#35bf28}+0.31\\%$ | | test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 3.8562ms | 3.5203ms | 284.0637 Ops/s | 291.5949 Ops/s | $\color{#d91a1a}-2.58\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 0.9525ms | 0.4949ms | 2.0207 KOps/s | 1.9187 KOps/s | $\textbf{\color{#35bf28}+5.32\\%}$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 0.7620ms | 0.4719ms | 2.1190 KOps/s | 2.1129 KOps/s | $\color{#35bf28}+0.29\\%$ | | test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.8689ms | 3.4414ms | 290.5762 Ops/s | 296.3669 Ops/s | $\color{#d91a1a}-1.95\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 1.1808ms | 0.4894ms | 2.0435 KOps/s | 2.0260 KOps/s | $\color{#35bf28}+0.86\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 0.8313ms | 0.4656ms | 2.1479 KOps/s | 2.1340 KOps/s | $\color{#35bf28}+0.65\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] | 1.8562ms | 1.6835ms | 594.0028 Ops/s | 588.5154 Ops/s | $\color{#35bf28}+0.93\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] | 5.8767ms | 1.6752ms | 596.9592 Ops/s | 624.7120 Ops/s | $\color{#d91a1a}-4.44\\%$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 4.0163ms | 3.6317ms | 275.3516 Ops/s | 283.5997 Ops/s | $\color{#d91a1a}-2.91\\%$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 0.8957ms | 0.6077ms | 1.6457 KOps/s | 1.4537 KOps/s | $\textbf{\color{#35bf28}+13.21\\%}$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 0.9469ms | 0.6121ms | 1.6337 KOps/s | 1.7038 KOps/s | $\color{#d91a1a}-4.12\\%$ | | test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 3.6729ms | 3.5188ms | 284.1882 Ops/s | 295.7819 Ops/s | $\color{#d91a1a}-3.92\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 0.6241ms | 0.4963ms | 2.0150 KOps/s | 1.9907 KOps/s | $\color{#35bf28}+1.22\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 1.2025ms | 0.4803ms | 2.0822 KOps/s | 2.0988 KOps/s | $\color{#d91a1a}-0.79\\%$ | | test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.9169ms | 3.5734ms | 279.8423 Ops/s | 298.6102 Ops/s | $\textbf{\color{#d91a1a}-6.29\\%}$ | | test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 1.0338ms | 0.4867ms | 2.0547 KOps/s | 2.0391 KOps/s | $\color{#35bf28}+0.77\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 0.7904ms | 0.4689ms | 2.1325 KOps/s | 2.1014 KOps/s | $\color{#35bf28}+1.48\\%$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 3.9345ms | 3.6907ms | 270.9542 Ops/s | 286.2378 Ops/s | $\textbf{\color{#d91a1a}-5.34\\%}$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 0.9495ms | 0.6151ms | 1.6257 KOps/s | 1.6228 KOps/s | $\color{#35bf28}+0.18\\%$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 3.9971ms | 0.5894ms | 1.6966 KOps/s | 1.6907 KOps/s | $\color{#35bf28}+0.35\\%$ | | test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] | 0.1001s | 7.8698ms | 127.0684 Ops/s | 133.7878 Ops/s | $\textbf{\color{#d91a1a}-5.02\\%}$ | | test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] | 14.0244ms | 12.0853ms | 82.7454 Ops/s | 78.5992 Ops/s | $\textbf{\color{#35bf28}+5.28\\%}$ | | test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] | 1.5295ms | 1.0381ms | 963.3048 Ops/s | 952.2411 Ops/s | $\color{#35bf28}+1.16\\%$ | | test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] | 92.6896ms | 7.2269ms | 138.3710 Ops/s | 182.9572 Ops/s | $\textbf{\color{#d91a1a}-24.37\\%}$ | | test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] | 14.7396ms | 12.1825ms | 82.0852 Ops/s | 79.5937 Ops/s | $\color{#35bf28}+3.13\\%$ | | test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] | 1.5912ms | 1.0422ms | 959.5006 Ops/s | 901.9687 Ops/s | $\textbf{\color{#35bf28}+6.38\\%}$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] | 91.4655ms | 5.5483ms | 180.2363 Ops/s | 139.5870 Ops/s | $\textbf{\color{#35bf28}+29.12\\%}$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] | 14.5767ms | 12.3381ms | 81.0495 Ops/s | 78.7748 Ops/s | $\color{#35bf28}+2.89\\%$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] | 1.7043ms | 1.1932ms | 838.0811 Ops/s | 777.5394 Ops/s | $\textbf{\color{#35bf28}+7.79\\%}$ |

github-actions[bot] commented 1 month ago

$\color{#D29922}\textsf{\Large\⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 94. Improved: $\large\color{#35bf28}0$. Worsened: $\large\color{#d91a1a}3$.

Expand to view detailed results

| Name | Max | Mean | Ops | Ops on Repo `HEAD` | Change | | ----------------------------------------------------------------------------------------- | --------- | --------- | -------------- | ------------------ | ----------------------------------- | | test_single | 0.1247s | 0.1219s | 8.2066 Ops/s | 7.9619 Ops/s | $\color{#35bf28}+3.07\\%$ | | test_sync | 99.4821ms | 97.5342ms | 10.2528 Ops/s | 9.7877 Ops/s | $\color{#35bf28}+4.75\\%$ | | test_async | 0.2012s | 0.1015s | 9.8526 Ops/s | 12.2151 Ops/s | $\textbf{\color{#d91a1a}-19.34\\%}$ | | test_single_pixels | 0.1322s | 0.1301s | 7.6857 Ops/s | 7.6848 Ops/s | $\color{#35bf28}+0.01\\%$ | | test_sync_pixels | 84.5871ms | 81.4647ms | 12.2753 Ops/s | 12.2654 Ops/s | $\color{#35bf28}+0.08\\%$ | | test_async_pixels | 0.1534s | 69.6501ms | 14.3575 Ops/s | 14.4305 Ops/s | $\color{#d91a1a}-0.51\\%$ | | test_simple | 0.8987s | 0.8377s | 1.1938 Ops/s | 1.2080 Ops/s | $\color{#d91a1a}-1.18\\%$ | | test_transformed | 1.1691s | 1.1078s | 0.9027 Ops/s | 0.9246 Ops/s | $\color{#d91a1a}-2.37\\%$ | | test_serial | 2.6078s | 2.5460s | 0.3928 Ops/s | 0.3906 Ops/s | $\color{#35bf28}+0.55\\%$ | | test_parallel | 2.4279s | 2.3704s | 0.4219 Ops/s | 0.4216 Ops/s | $\color{#35bf28}+0.07\\%$ | | test_step_mdp_speed[True-True-True-True-True] | 0.1042ms | 34.3438μs | 29.1173 KOps/s | 30.3251 KOps/s | $\color{#d91a1a}-3.98\\%$ | | test_step_mdp_speed[True-True-True-True-False] | 47.0310μs | 20.0846μs | 49.7895 KOps/s | 50.6616 KOps/s | $\color{#d91a1a}-1.72\\%$ | | test_step_mdp_speed[True-True-True-False-True] | 46.8110μs | 19.7952μs | 50.5174 KOps/s | 53.2348 KOps/s | $\textbf{\color{#d91a1a}-5.10\\%}$ | | test_step_mdp_speed[True-True-True-False-False] | 33.1100μs | 11.3293μs | 88.2670 KOps/s | 88.5778 KOps/s | $\color{#d91a1a}-0.35\\%$ | | test_step_mdp_speed[True-True-False-True-True] | 53.3810μs | 35.8582μs | 27.8877 KOps/s | 28.6471 KOps/s | $\color{#d91a1a}-2.65\\%$ | | test_step_mdp_speed[True-True-False-True-False] | 92.6510μs | 21.9436μs | 45.5715 KOps/s | 46.4418 KOps/s | $\color{#d91a1a}-1.87\\%$ | | test_step_mdp_speed[True-True-False-False-True] | 47.3600μs | 21.5297μs | 46.4474 KOps/s | 47.6835 KOps/s | $\color{#d91a1a}-2.59\\%$ | | test_step_mdp_speed[True-True-False-False-False] | 31.6910μs | 13.4313μs | 74.4530 KOps/s | 76.5390 KOps/s | $\color{#d91a1a}-2.73\\%$ | | test_step_mdp_speed[True-False-True-True-True] | 62.5310μs | 37.8340μs | 26.4312 KOps/s | 27.3187 KOps/s | $\color{#d91a1a}-3.25\\%$ | | test_step_mdp_speed[True-False-True-True-False] | 45.6120μs | 23.8456μs | 41.9364 KOps/s | 42.8202 KOps/s | $\color{#d91a1a}-2.06\\%$ | | test_step_mdp_speed[True-False-True-False-True] | 47.6710μs | 21.4381μs | 46.6459 KOps/s | 47.7618 KOps/s | $\color{#d91a1a}-2.34\\%$ | | test_step_mdp_speed[True-False-True-False-False] | 32.7000μs | 13.3613μs | 74.8430 KOps/s | 76.1968 KOps/s | $\color{#d91a1a}-1.78\\%$ | | test_step_mdp_speed[True-False-False-True-True] | 76.3720μs | 39.0205μs | 25.6275 KOps/s | 25.9224 KOps/s | $\color{#d91a1a}-1.14\\%$ | | test_step_mdp_speed[True-False-False-True-False] | 52.6110μs | 25.6643μs | 38.9646 KOps/s | 39.5361 KOps/s | $\color{#d91a1a}-1.45\\%$ | | test_step_mdp_speed[True-False-False-False-True] | 97.6830μs | 22.9926μs | 43.4922 KOps/s | 43.9993 KOps/s | $\color{#d91a1a}-1.15\\%$ | | test_step_mdp_speed[True-False-False-False-False] | 38.2310μs | 15.2051μs | 65.7674 KOps/s | 66.6305 KOps/s | $\color{#d91a1a}-1.30\\%$ | | test_step_mdp_speed[False-True-True-True-True] | 57.0110μs | 37.0565μs | 26.9858 KOps/s | 26.9680 KOps/s | $\color{#35bf28}+0.07\\%$ | | test_step_mdp_speed[False-True-True-True-False] | 47.0620μs | 23.5977μs | 42.3771 KOps/s | 42.3610 KOps/s | $\color{#35bf28}+0.04\\%$ | | test_step_mdp_speed[False-True-True-False-True] | 39.8000μs | 25.5388μs | 39.1561 KOps/s | 39.9572 KOps/s | $\color{#d91a1a}-2.01\\%$ | | test_step_mdp_speed[False-True-True-False-False] | 31.5210μs | 15.0657μs | 66.3758 KOps/s | 66.9541 KOps/s | $\color{#d91a1a}-0.86\\%$ | | test_step_mdp_speed[False-True-False-True-True] | 78.0610μs | 39.2871μs | 25.4536 KOps/s | 25.9605 KOps/s | $\color{#d91a1a}-1.95\\%$ | | test_step_mdp_speed[False-True-False-True-False] | 48.7110μs | 25.4869μs | 39.2359 KOps/s | 39.6686 KOps/s | $\color{#d91a1a}-1.09\\%$ | | test_step_mdp_speed[False-True-False-False-True] | 51.7100μs | 27.2187μs | 36.7395 KOps/s | 37.2501 KOps/s | $\color{#d91a1a}-1.37\\%$ | | test_step_mdp_speed[False-True-False-False-False] | 40.5410μs | 16.8886μs | 59.2114 KOps/s | 59.0581 KOps/s | $\color{#35bf28}+0.26\\%$ | | test_step_mdp_speed[False-False-True-True-True] | 59.4200μs | 41.1592μs | 24.2959 KOps/s | 24.8247 KOps/s | $\color{#d91a1a}-2.13\\%$ | | test_step_mdp_speed[False-False-True-True-False] | 51.2710μs | 27.5354μs | 36.3169 KOps/s | 36.5849 KOps/s | $\color{#d91a1a}-0.73\\%$ | | test_step_mdp_speed[False-False-True-False-True] | 55.7510μs | 28.0172μs | 35.6924 KOps/s | 36.6941 KOps/s | $\color{#d91a1a}-2.73\\%$ | | test_step_mdp_speed[False-False-True-False-False] | 37.3200μs | 17.1487μs | 58.3136 KOps/s | 58.8771 KOps/s | $\color{#d91a1a}-0.96\\%$ | | test_step_mdp_speed[False-False-False-True-True] | 58.1020μs | 43.7889μs | 22.8368 KOps/s | 23.0226 KOps/s | $\color{#d91a1a}-0.81\\%$ | | test_step_mdp_speed[False-False-False-True-False] | 54.7920μs | 29.4428μs | 33.9641 KOps/s | 33.6025 KOps/s | $\color{#35bf28}+1.08\\%$ | | test_step_mdp_speed[False-False-False-False-True] | 46.8800μs | 28.9329μs | 34.5628 KOps/s | 34.6461 KOps/s | $\color{#d91a1a}-0.24\\%$ | | test_step_mdp_speed[False-False-False-False-False] | 40.2610μs | 18.7618μs | 53.2999 KOps/s | 53.1594 KOps/s | $\color{#35bf28}+0.26\\%$ | | test_values[generalized_advantage_estimate-True-True] | 27.2343ms | 26.0084ms | 38.4492 Ops/s | 37.9881 Ops/s | $\color{#35bf28}+1.21\\%$ | | test_values[vec_generalized_advantage_estimate-True-True] | 89.5652ms | 2.7031ms | 369.9388 Ops/s | 375.2375 Ops/s | $\color{#d91a1a}-1.41\\%$ | | test_values[td0_return_estimate-False-False] | 90.4710μs | 68.4001μs | 14.6199 KOps/s | 14.7702 KOps/s | $\color{#d91a1a}-1.02\\%$ | | test_values[td1_return_estimate-False-False] | 60.8827ms | 57.5631ms | 17.3722 Ops/s | 16.7169 Ops/s | $\color{#35bf28}+3.92\\%$ | | test_values[vec_td1_return_estimate-False-False] | 1.3054ms | 1.0998ms | 909.2529 Ops/s | 903.7523 Ops/s | $\color{#35bf28}+0.61\\%$ | | test_values[td_lambda_return_estimate-True-False] | 97.1568ms | 91.9891ms | 10.8709 Ops/s | 11.0462 Ops/s | $\color{#d91a1a}-1.59\\%$ | | test_values[vec_td_lambda_return_estimate-True-False] | 1.2597ms | 1.0977ms | 910.9856 Ops/s | 909.0890 Ops/s | $\color{#35bf28}+0.21\\%$ | | test_gae_speed[generalized_advantage_estimate-False-1-512] | 25.8896ms | 25.6867ms | 38.9307 Ops/s | 38.4613 Ops/s | $\color{#35bf28}+1.22\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-True-1-512] | 0.9914ms | 0.7398ms | 1.3517 KOps/s | 1.3459 KOps/s | $\color{#35bf28}+0.43\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-False-1-512] | 0.7577ms | 0.6830ms | 1.4642 KOps/s | 1.4594 KOps/s | $\color{#35bf28}+0.33\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-True-32-512] | 1.5657ms | 1.4914ms | 670.4977 Ops/s | 672.5191 Ops/s | $\color{#d91a1a}-0.30\\%$ | | test_gae_speed[vec_generalized_advantage_estimate-False-32-512] | 0.7710ms | 0.7337ms | 1.3629 KOps/s | 1.4300 KOps/s | $\color{#d91a1a}-4.69\\%$ | | test_dqn_speed | 1.8492ms | 1.5069ms | 663.6203 Ops/s | 675.0348 Ops/s | $\color{#d91a1a}-1.69\\%$ | | test_ddpg_speed | 3.1779ms | 3.0525ms | 327.6003 Ops/s | 328.4542 Ops/s | $\color{#d91a1a}-0.26\\%$ | | test_sac_speed | 9.0255ms | 8.7509ms | 114.2737 Ops/s | 115.8512 Ops/s | $\color{#d91a1a}-1.36\\%$ | | test_redq_speed | 12.4803ms | 10.8554ms | 92.1204 Ops/s | 92.1711 Ops/s | $\color{#d91a1a}-0.06\\%$ | | test_redq_deprec_speed | 12.4117ms | 11.6500ms | 85.8367 Ops/s | 81.8555 Ops/s | $\color{#35bf28}+4.86\\%$ | | test_td3_speed | 8.8584ms | 8.6303ms | 115.8710 Ops/s | 116.3523 Ops/s | $\color{#d91a1a}-0.41\\%$ | | test_cql_speed | 27.8610ms | 26.3357ms | 37.9713 Ops/s | 38.0951 Ops/s | $\color{#d91a1a}-0.32\\%$ | | test_a2c_speed | 6.1794ms | 5.6258ms | 177.7529 Ops/s | 172.2162 Ops/s | $\color{#35bf28}+3.21\\%$ | | test_ppo_speed | 6.5330ms | 5.9797ms | 167.2322 Ops/s | 162.9063 Ops/s | $\color{#35bf28}+2.66\\%$ | | test_reinforce_speed | 5.3603ms | 4.6012ms | 217.3324 Ops/s | 208.2109 Ops/s | $\color{#35bf28}+4.38\\%$ | | test_iql_speed | 20.5949ms | 19.9497ms | 50.1260 Ops/s | 49.4915 Ops/s | $\color{#35bf28}+1.28\\%$ | | test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 5.1486ms | 4.8638ms | 205.5987 Ops/s | 204.6035 Ops/s | $\color{#35bf28}+0.49\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 0.7258ms | 0.5987ms | 1.6704 KOps/s | 1.6703 KOps/s | $+0.00\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 4.4106ms | 0.5812ms | 1.7206 KOps/s | 1.7465 KOps/s | $\color{#d91a1a}-1.48\\%$ | | test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 5.0072ms | 4.8062ms | 208.0634 Ops/s | 205.5567 Ops/s | $\color{#35bf28}+1.22\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 0.7070ms | 0.5909ms | 1.6923 KOps/s | 1.6770 KOps/s | $\color{#35bf28}+0.91\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 4.4329ms | 0.5703ms | 1.7535 KOps/s | 1.7575 KOps/s | $\color{#d91a1a}-0.23\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] | 2.3754ms | 2.1413ms | 467.0046 Ops/s | 466.2946 Ops/s | $\color{#35bf28}+0.15\\%$ | | test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] | 5.8159ms | 2.0457ms | 488.8358 Ops/s | 491.7983 Ops/s | $\color{#d91a1a}-0.60\\%$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 5.1108ms | 4.9754ms | 200.9893 Ops/s | 199.6342 Ops/s | $\color{#35bf28}+0.68\\%$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 1.4664ms | 0.7283ms | 1.3731 KOps/s | 1.3762 KOps/s | $\color{#d91a1a}-0.22\\%$ | | test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 0.8874ms | 0.7048ms | 1.4188 KOps/s | 1.4154 KOps/s | $\color{#35bf28}+0.24\\%$ | | test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 5.0340ms | 4.8781ms | 204.9971 Ops/s | 204.7488 Ops/s | $\color{#35bf28}+0.12\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 1.3017ms | 0.6022ms | 1.6605 KOps/s | 1.6660 KOps/s | $\color{#d91a1a}-0.34\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 0.6878ms | 0.5762ms | 1.7354 KOps/s | 1.7238 KOps/s | $\color{#35bf28}+0.67\\%$ | | test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 5.0378ms | 4.8468ms | 206.3227 Ops/s | 205.3881 Ops/s | $\color{#35bf28}+0.46\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 0.7085ms | 0.5926ms | 1.6874 KOps/s | 1.6753 KOps/s | $\color{#35bf28}+0.72\\%$ | | test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 0.7397ms | 0.5691ms | 1.7571 KOps/s | 1.7436 KOps/s | $\color{#35bf28}+0.78\\%$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 5.0843ms | 4.9830ms | 200.6806 Ops/s | 199.8634 Ops/s | $\color{#35bf28}+0.41\\%$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 1.4396ms | 0.7280ms | 1.3737 KOps/s | 1.3687 KOps/s | $\color{#35bf28}+0.36\\%$ | | test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 0.8708ms | 0.7048ms | 1.4188 KOps/s | 1.4139 KOps/s | $\color{#35bf28}+0.35\\%$ | | test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] | 0.1132s | 9.2914ms | 107.6263 Ops/s | 106.2519 Ops/s | $\color{#35bf28}+1.29\\%$ | | test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] | 20.9692ms | 17.0166ms | 58.7663 Ops/s | 59.0157 Ops/s | $\color{#d91a1a}-0.42\\%$ | | test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] | 2.3498ms | 1.3488ms | 741.4260 Ops/s | 736.6960 Ops/s | $\color{#35bf28}+0.64\\%$ | | test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] | 0.1051s | 7.1471ms | 139.9167 Ops/s | 139.4236 Ops/s | $\color{#35bf28}+0.35\\%$ | | test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] | 19.4005ms | 16.8990ms | 59.1751 Ops/s | 59.7428 Ops/s | $\color{#d91a1a}-0.95\\%$ | | test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] | 6.8648ms | 1.4606ms | 684.6658 Ops/s | 734.3014 Ops/s | $\textbf{\color{#d91a1a}-6.76\\%}$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] | 0.1055s | 9.3444ms | 107.0160 Ops/s | 106.4787 Ops/s | $\color{#35bf28}+0.50\\%$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] | 19.3551ms | 16.7573ms | 59.6756 Ops/s | 57.9838 Ops/s | $\color{#35bf28}+2.92\\%$ | | test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] | 2.5454ms | 1.5052ms | 664.3746 Ops/s | 661.6923 Ops/s | $\color{#35bf28}+0.41\\%$ |

shagunsodhani commented 4 weeks ago

cc @teopir

cc @shagunsodhani this is a good example of prealloc with tensordict. We were using a lot of lazy stacks and stacking at the last minute. Using a preallocated TD instead (create an empty td -> get a bunch of views of that td -> write on the first view, and all views get instantiated instantaneously) made the whole thing 20 - 1000x faster!

Awesome <3

pytorch / rl