pytorch / torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
BSD 3-Clause "New" or "Revised" License
1.01k stars 124 forks source link

Peak memory increase with just Inductor compiled graph #1479

Closed anijain2305 closed 1 year ago

anijain2305 commented 2 years ago

For volo_d1_224, I was trying to figure out where does the memory footprint go.

------
Before running eager: 8.43 GB
After running eager: 10.50 GB
Increase in peak memory in eager: 2.07

------
Before running inductor: 8.43 GB
inductor compiler wrapped: 9.46 GB
after running inductor: 15.64 GB
Increase in peak memory in inductor: 7.21

The repro script is here - https://gist.github.com/anijain2305/d1b3dfbcae25a854a5e5640bd63d1171

The graph is really large. Minfiier is not able to minify it meaningfully.

cc @jansel @ngimel

jansel commented 2 years ago

We could try disabling horizontal fusion (which could increase memory usage).

Replace this line with return False: https://github.com/pytorch/torchdynamo/blob/0c96ddbad26216671861762c161ebcfb0fdbeb9f/torchinductor/scheduler.py#L913

anijain2305 commented 2 years ago

Yes, its because of horizontal fusion. The footprint dropped drastically, and now is even better than eager

Start of file: 0.00 GB
Inputs setup: 8.43 GB
make_fx done: 10.50 GB

------
Before running eager: 8.43 GB
After running eager: 10.50 GB
Increase in peak memory in eager: 2.07

------
Before running inductor: 8.43 GB
inductor compiler wrapped: 9.46 GB
after running inductor: 9.46 GB
Increase in peak memory in inductor: 1.03

What should we do here? Heuristic? I guess this is a standard tradeoff problem.

jansel commented 2 years ago

Yeah, it is a performance versus memory issue. Horizontal fusion reduces total memory reads, so will make things go faster.

Chillee commented 2 years ago

@jansel I think we also just need to investigate some more memory optimization passes :) Most trivially, we should probably do a pass at the end that reorders scheduler nodes to minimize peak memory (I think we have all the needed info).

desertfire commented 2 years ago
  • We could have a "low memory mode" where we disable horizontal fusions.

This is immediately actionable to fix some OOMs when doing accuracy testing in our CI and nightly. I can work on it.

  • We could implement some sort of fancy analysis to reject fusions given some sort of memory score.

@stumpOS, this might be interesting to you.

stumpOS commented 2 years ago

Unable to repro:

------
Before running eager: 0.15 GB
After running eager: 11.09 GB
Increase in peak memory in eager: 10.94

------
Before running inductor: 0.15 GB
inductor compiler wrapped: 8.88 GB
after running inductor: 8.88 GB
Increase in peak memory in inductor: 8.73

Script: https://pastebin.com/KnAM4di3

anijain2305 commented 1 year ago

This is not witnessed anymore. Closing the task.