Peak memory increase with just Inductor compiled graph

anijain2305 commented 2 years ago

For volo_d1_224, I was trying to figure out where does the memory footprint go.

One known issue is that Python hold references of inputs to the backward pass due to how AOT Autograd works.
So, I used aot_inductor_debug backend, which uses same decomps, partitioner as Inductor. I was hoping to see same increase in memory footprint as Inductor, if extra Python references were the culprit. But, I saw 25% (out of 40% increase) was not accounted.
So, I dumped the backward graph and manually checked the peak memory of eager vs Inductor. I observed that eager raises the peak memory by 2 GB, while inductor is raising it by 7 GB.

------
Before running eager: 8.43 GB
After running eager: 10.50 GB
Increase in peak memory in eager: 2.07

------
Before running inductor: 8.43 GB
inductor compiler wrapped: 9.46 GB
after running inductor: 15.64 GB
Increase in peak memory in inductor: 7.21

The repro script is here - https://gist.github.com/anijain2305/d1b3dfbcae25a854a5e5640bd63d1171

The graph is really large. Minfiier is not able to minify it meaningfully.

cc @jansel @ngimel

jansel commented 2 years ago

We could try disabling horizontal fusion (which could increase memory usage).

Replace this line with return False: https://github.com/pytorch/torchdynamo/blob/0c96ddbad26216671861762c161ebcfb0fdbeb9f/torchinductor/scheduler.py#L913

anijain2305 commented 2 years ago

Yes, its because of horizontal fusion. The footprint dropped drastically, and now is even better than eager

Start of file: 0.00 GB
Inputs setup: 8.43 GB
make_fx done: 10.50 GB

------
Before running eager: 8.43 GB
After running eager: 10.50 GB
Increase in peak memory in eager: 2.07

------
Before running inductor: 8.43 GB
inductor compiler wrapped: 9.46 GB
after running inductor: 9.46 GB
Increase in peak memory in inductor: 1.03

What should we do here? Heuristic? I guess this is a standard tradeoff problem.

jansel commented 2 years ago

Yeah, it is a performance versus memory issue. Horizontal fusion reduces total memory reads, so will make things go faster.

We could have a "low memory mode" where we disable horizontal fusions.
We could reduce torchinductor.config.max_fusion_size (currently 64), perhaps there are some mega-fusions that are bloating memory usage in some models.
We could implement some sort of fancy analysis to reject fusions given some sort of memory score.

Chillee commented 2 years ago

@jansel I think we also just need to investigate some more memory optimization passes :) Most trivially, we should probably do a pass at the end that reorders scheduler nodes to minimize peak memory (I think we have all the needed info).

desertfire commented 2 years ago

We could have a "low memory mode" where we disable horizontal fusions.

This is immediately actionable to fix some OOMs when doing accuracy testing in our CI and nightly. I can work on it.

We could implement some sort of fancy analysis to reject fusions given some sort of memory score.

@stumpOS, this might be interesting to you.

stumpOS commented 2 years ago

Unable to repro:

------
Before running eager: 0.15 GB
After running eager: 11.09 GB
Increase in peak memory in eager: 10.94

------
Before running inductor: 0.15 GB
inductor compiler wrapped: 8.88 GB
after running inductor: 8.88 GB
Increase in peak memory in inductor: 8.73

Script: https://pastebin.com/KnAM4di3

anijain2305 commented 1 year ago

This is not witnessed anymore. Closing the task.

pytorch / torchdynamo

Peak memory increase with just Inductor compiled graph #1479