Closed anijain2305 closed 1 year ago
We could try disabling horizontal fusion (which could increase memory usage).
Replace this line with return False
:
https://github.com/pytorch/torchdynamo/blob/0c96ddbad26216671861762c161ebcfb0fdbeb9f/torchinductor/scheduler.py#L913
Yes, its because of horizontal fusion. The footprint dropped drastically, and now is even better than eager
Start of file: 0.00 GB
Inputs setup: 8.43 GB
make_fx done: 10.50 GB
------
Before running eager: 8.43 GB
After running eager: 10.50 GB
Increase in peak memory in eager: 2.07
------
Before running inductor: 8.43 GB
inductor compiler wrapped: 9.46 GB
after running inductor: 9.46 GB
Increase in peak memory in inductor: 1.03
What should we do here? Heuristic? I guess this is a standard tradeoff problem.
Yeah, it is a performance versus memory issue. Horizontal fusion reduces total memory reads, so will make things go faster.
torchinductor.config.max_fusion_size
(currently 64), perhaps there are some mega-fusions that are bloating memory usage in some models.@jansel I think we also just need to investigate some more memory optimization passes :) Most trivially, we should probably do a pass at the end that reorders scheduler nodes to minimize peak memory (I think we have all the needed info).
- We could have a "low memory mode" where we disable horizontal fusions.
This is immediately actionable to fix some OOMs when doing accuracy testing in our CI and nightly. I can work on it.
- We could implement some sort of fancy analysis to reject fusions given some sort of memory score.
@stumpOS, this might be interesting to you.
Unable to repro:
------
Before running eager: 0.15 GB
After running eager: 11.09 GB
Increase in peak memory in eager: 10.94
------
Before running inductor: 0.15 GB
inductor compiler wrapped: 8.88 GB
after running inductor: 8.88 GB
Increase in peak memory in inductor: 8.73
Script: https://pastebin.com/KnAM4di3
This is not witnessed anymore. Closing the task.
For
volo_d1_224
, I was trying to figure out where does the memory footprint go.One known issue is that Python hold references of inputs to the backward pass due to how AOT Autograd works.
So, I used
aot_inductor_debug
backend, which uses same decomps, partitioner as Inductor. I was hoping to see same increase in memory footprint as Inductor, if extra Python references were the culprit. But, I saw 25% (out of 40% increase) was not accounted.So, I dumped the backward graph and manually checked the peak memory of eager vs Inductor. I observed that eager raises the peak memory by 2 GB, while inductor is raising it by 7 GB.
The repro script is here - https://gist.github.com/anijain2305/d1b3dfbcae25a854a5e5640bd63d1171
The graph is really large. Minfiier is not able to minify it meaningfully.
cc @jansel @ngimel