google / ml-compiler-opt

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.
Apache License 2.0
629 stars 93 forks source link

[Queries] Regarding usage of LLVM built with Pretrained Models and Development Mode #348

Closed quic-garvgupt closed 1 month ago

quic-garvgupt commented 2 months ago

Hi,

I have successfully built a toolchain using the model inlining-Oz-v1.1 released [here(https://github.com/google/ml-compiler-opt/releases). However, I have some queries regarding its usage while building an application in release mode, as well as some questions pertaining to development mode.

Release Mode

  1. Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.
  2. While building llvm for release mode, is it necessary to disable LTO?
  3. After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

Development Mode

  1. While building llvm to generate the corpus for training mode, is it necessary to disable LTO?
  2. When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?
  3. The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?
boomanaiden154 commented 2 months ago

Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.

It depends upon if you're doing (Thin)LTO. If you're not using (Thin)LTO, then you should be fine omitting it from the linker. If you are using some form of LTO, then you need to pass it to the linker too so that it will use the policy for inlining there.

While building llvm for release mode, is it necessary to disable LTO?

No. The build options for LLVM should not matter.

After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

There shouldn't be anything major. You can use (Thin)LTO to build the application. You just need to make sure to pass the flag to the linker too so that it will use the correct inlining policy. The policy might also change in effectiveness when going to LTO, depending upon the corpus that it was trained on.

While building llvm to generate the corpus for training mode, is it necessary to disable LTO?

No. You should be able to use pretty much whatever build options you like for LLVM.

When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?

Ideally it should be representative of how you build your application in production. If you don't use (Thin)LTO there, then training on a (Thin)LTO corpus does not make sense. If you do, then training on a non-(Thin)LTO corpus does not make a lot of sense.

The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?

It's not a LLVM flag. It would be flags/different scripts within this repository that drive the training pipeline. The demos are currently written to use PPO. There is no end-to-end script that uses ES for training currently, although getting one written isn't too big of a deal now that most of the ES stuff is upstreamed.

mtrofin commented 2 months ago

Should the -mllvm -enable-ml-inliner=release flag be added only to the clang driver, or to both the clang driver and the linker? The application I am building invokes compiler and linker through separate command line invocations.

I assume your build performs some kind of LTO. There's no hard and fast rule, I'd experiment with/without enabling in the backend optimization. FWIW, the model we have here was trained assuming no LTO step.

While building llvm for release mode, is it necessary to disable LTO?

(IIUC this is about building e.g. clang itself) Shouldn't be necessary to disable anything, all that building llvm with release model does is (insofar the linker is concerned) add a precompiled library and a .o. We build clang with ThinLTO, and embed (release mode) both inliner and regalloc models. However, we don't have a full LTO scenario. Is something broken when doing a full LTO of clang and trying to embed a model?

After building llvm for release mode, are there any restrictions on the usage of certain flags which cannot be used like -flto etc. while building the application. Can LTO (both thin and full) be enabled when building the application?

No restrictions, but "mileage may vary": for example, if you trained on a corpus of post-thinLTO IR modules, you'll get best results when applying that model to similar modules. One culprit to this is that features get quantized (bucketized), and if the distribution of feature values is too far off, benefits would degrade.

While building llvm to generate the corpus for training mode, is it necessary to disable LTO?

You don't need LLVM built in any different way, btw, to collect a corpus. The functionality for corpus collection is in any build of clang. The main thing is to use the same compiler version (i.e. from the same llvm repo githash) when collecting the corpus as when later compiling it, just to avoid things like IR breaking changes. So you could use the clang you build for training, for example.

When building the application to generate the corpus, can LTO (both thin and full) be enabled, or should LTO be disabled while building the application?

You can collect the IR corpus from either before the pre-thinlink compilation or from post-thinlink. We never do anything with LTO, only ThinLTO, so we never added that support to full LTO. To answer your question, it's less about how you build that application and more about which IR you want to train on.

If your scenario involves ThinLTO, I'd recommend starting by training on the post-link IR first - i.e. have the normal inliner in the frontend, and ML in the backend. Then it gets tricky and you need to experiment - you could stop there (i.e. if you get reasonable savings, just use ML in the post-thinlink); or try the model in both front and back; or you could build a second corpus from the frontend IR and continue training there; or (probably best) collect the 2 corpora first, do quantization on them, then train on one and then finetune on the other. We did the "train mostly on the back, finetune in front" without quantization for Chrome on Android 32 bit (@Northbadge did that and he can correct me if I misremember), and only "back" for 64 bit, for example (that bit is fresher in memory, @alekh @tvmarino's work).

The paper published two different strategies for training (PG and ES). Is there a way for a user to specify which training method to use through command line flags or other means while in development mode?

Not yet, but I have a cludge that demonstrates using ES in my fork: https://github.com/mtrofin/ml-compiler-opt/tree/es. Focus on "cludge". @boomanaiden154 has, I think, a plan to bring ES into the fold cleanly.

mtrofin commented 2 months ago

Oh, just saw @boomanaiden154 also replied. Sorry for some duplicate info!

quic-garvgupt commented 2 months ago

Thank you for the detailed response, @mtrofin and @boomanaiden154.

However, we don't have a full LTO scenario. Is something broken when doing a full LTO of clang and trying to embed a model?

I was going through demo and noticed LTO being disabled in developnment mode hence the question. I am building clang without any LTO as well however wanted to clarify is this was a necessity or can we be build clang with any options.

You can collect the IR corpus from either before the pre-thinlink compilation or from post-thinlink. We never do anything with LTO, only ThinLTO, so we never added that support to full LTO.

I missed the point that corpus collection is only supported for thin LTO at the moment and not full LTO.

Not yet, but I have a cludge that demonstrates using ES in my fork: https://github.com/mtrofin/ml-compiler-opt/tree/es. Focus on "cludge". @boomanaiden154 has, I think, a plan to bring ES into the fold cleanly.

Thanks for sharing this. I'll definitely try this out. Just to be clear, this follows the same instructions as mentioned the demo and it will use ES strategy to train the model?

I will keep this issue open for some time while I work on this project in case I have any further queries or comments. Thank you once again!

mtrofin commented 2 months ago

I missed the point that corpus collection is only supported for thin LTO at the moment and not full LTO.

...and no-lto (i.e. just frontend - like, IIUC, your scenario)

Just to be clear, this follows the same instructions as mentioned the demo and it will use ES strategy to train the model?

In broad strokes, yes, i.e. if you treat the training script as a black box, then everything else should be the same; but I'd recommend checking (like debugging or print-ing in the python code) to make sure you're doing ES - like I said, that branch is a cludge (note the absence of tests, for instance :) )

quic-garvgupt commented 2 months ago

...and no-lto (i.e. just frontend - like, IIUC, your scenario)

To clarify, is there currently support for corpus collection at both thin LTO and no LTO? Apologies for asking this again, but I interpreted your response as "corpus collection is only supported for thin LTO and no LTO".

As you mentioned, I am building the application with no LTO, and after I do _extractir, my corpus description contains no modules. So wanted to know if I am messing at some place or if corpus extraction for application built with no LTO is not supported. If this is the case, could you provide some suggestion on how to enable corpus extraction for application built with no LTO?

mtrofin commented 2 months ago

Yup, see here: https://github.com/llvm/llvm-project/blob/main/llvm/utils/mlgo-utils/mlgo/corpus/extract_ir.py#L12

There are some more nuances with local thinlto, if you chase the --thinlto_build flag in the script, those should become clear.

quic-garvgupt commented 2 months ago

Thanks for your response!

I successfully extracted IR, generated a corpus, and trained a warmstart model. Currently, the training of the RL model is still in progress. I want to get a rough idea of the training time because the data I’m using is fairly small—only 88 modules, as mentioned in the info after trace collection.

I0821 09:36:28.822783 140070467577600 generate_default_trace.py:202] 88 success, 0 failed out of 88 88 of 88 modules succeeded, and 39 trainining examples written

It took about 45 minutes to train the warmstart model, and it has been more than 8 hours since the RL model training began. Is there any rough estimate of how long the training might take for the above number of modules on a 32-core machine with 64 GB of RAM? I am using the default set of parameters for the model as mentioned in /local/mnt/workspace/garvgupt/ml-compiler-opt/compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin file.

Additionally, since I am still a novice in model engineering, any advice on what values to set or how to decide the values for the parameters mentioned in the above gin file for the small training dataset would be appreciated. TIA

mtrofin commented 2 months ago

Is there any rough estimate of how long the training might take for the above number of modules on a 32-core machine with 64 GB of RAM?

If you look at the tensorboard progression of the reward, especially since you are (IIUC) processing at each pass the entire corpus, that (tensorboard) should give you an indication (e.g. if it's not making much progress in improving the reward anymore, it probably learned enough).

You could also try the current saved model (it's under the output directory - make sure you don't pick the one called collect_ something)

Additionally, since I am still a novice in model engineering, any advice on what values to set or how to decide the values for the parameters mentioned in the above gin file for the small training dataset would be appreciated. TIA

IIRC we did a hyperparameter sweep using xmanager. The infra should be easily adaptable to that - and we did, internally, but haven't yet pushed upstream. But all that says is "trial and error", really.

quic-garvgupt commented 2 months ago

Questions In the graph above, rewards and the mean of rewards are plotted. Towards the end, it flattens out. Does this mean there is not much left to learn? I am also unsure what a reward of 0 indicates.

mtrofin commented 2 months ago

You want to look at reward_distribution. It should look like it's asymptotically reaching some positive value.

(this is mentioned in passing in the inlining demo, if you search for "tensorboard")