google / ml-compiler-opt

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.
Apache License 2.0
629 stars 93 forks source link

[non-issue] MLGO Questions #342

Closed TIHan closed 1 month ago

TIHan commented 6 months ago

This isn't an issue. I wasn't sure where to ask these questions, so I posted them as an issue if that's ok.

I apologize in advance if these seem very obvious. (new to ML) I have read through the MLGO paper as well.

MLGO Questions:

  1. What is the difference between Clang and InteractiveClang usage?
  2. MLGO keeps referring to "modules" for evaluating and training. Are "modules" a collection of functions?
  3. When training a model, does Clang stop and wait for MLGO's input?
  4. Does training use a batch of functions for a single step, or does it step per-function?
boomanaiden154 commented 6 months ago
  1. I believe InteractiveClang is just a wrapper around the interactive MLGO tooling that has been upstreamed into LLVM, allowing for unix pipes to be used to get data into and out of the compiler. @jacob-hegna would have more context for that, but isn't actively working on the project anymore. I don't believe the environments are currently used for the training scripts that are within this repository and it was designed to support some other use cases.
  2. Modules essentially correspond to a translation unit/source file. This project deals with LLVM bitcode, where a source translation unit is usually turned into an IR module, which is a collection of functions/globals. This part of the LangRef has additional information.
  3. This depends upon the mode. When running in interactive mode with pipes, the compiler does wait for the output. For all the default training pipelines available within this repository however, the compiler does not wait for any input. The models are exported into TFLite, and loaded by the compiler and executed. The compiler spits out a bunch of data that we then use for training, without us needing to provide any information to the compiler while it is running. In interactive mode, as mentioned before, that is different though.
  4. This depends upon the specific problem you're looking at. For inlining, it's a module, potentially with code imported from other modules if using ThinLTO. Inlining doesn't make sense with only a single function. Regalloc will do the reward per function. When training though, a large number of modules are compiled together (the default is 512 from what I remember) and then the batch is used to perform an iteration of training. The details of the training are in https://arxiv.org/abs/2101.04808, but some of the specific hyperparameters have changed, but those are all available in the gin config files.

That's at least how I recall things. Hopefully I'm not misremembering anything as there can be some subtle but impactful differences here. @mtrofin will see this when he's back from vacation and correct me if I'm incorrect anywhere.

TIHan commented 6 months ago

@boomanaiden154 Thank you for your informative answers! This is very helpful.

TIHan commented 6 months ago

@boomanaiden154 just a follow-up question: So my understanding is that, during training, Clang loads a policy from disk and uses it to make an inline decision. Do the results from that get immediately fed back into the model and then the policy gets updated on disk?

boomanaiden154 commented 6 months ago

During training, the log of the decisions that clang made using the policy loaded from disk are logged as a trajectory. The logs are then ingested by the tooling in this repository and used to train the model. We do a data collection phase where we compile a bunch of modules (512 is the default for regalloc), log all the trajectories, ingest them into the tooling here, do some training, and then iterate again.

TIHan commented 6 months ago

That makes sense to me. While the model is being trained, does it get written back to disk to be used for subsequent decisions made in clang?

mtrofin commented 6 months ago

The training process is basically 2-phased, in the non-InteractiveClang case: (1) the current model is given to clang, clang uses it without mutating it, and we make whatever observations; this happens for N modules in parallel; (2) the resulting N traces are fed to e.g. PPO or Reinforce (or, for Evolutionary Strategy, just the reward is fed to an ES algo) and we get the new model, and repeat.

If you use InteractiveClang you can do whatever you want every time clang asks for an advice.

TIHan commented 6 months ago

Thank you @mtrofin and @boomanaiden154 for your helpful answers. A lot of this is starting to make sense.

I have another question. I notice the output (or "inlining_decision") is a single scalar value (shape(1)) for the "saved_policy". I'm not sure how it got transformed into that since tf.keras.layers.Concatenate() was used for the preprocessing_combiner. The only way I know how to combine to a single scalar output is using tf.keras.layers.Add(), which I do not think you all are. Do you all know what makes the policy have a single scalar output even when using tf.keras.layers.Concatenate()?

boomanaiden154 commented 6 months ago

tf.keras.layers.Add does point-wise tensor addition and doesn't accumulate anything, so it doesn't actually combine anything into a single scalar. For the inlining policy specifically, the output spec is set here: https://github.com/google/ml-compiler-opt/blob/6ff14fd3092eeaec6d614169bc96a7b55c9d383a/compiler_opt/rl/inlining/config.py#L73 The network is just a tf_agents.networks.actor_distribution_network.ActorDistributionNetwork with some feed-forward layers. If everything is working how I assume, the output should just be a vector-vector multiplication of the last layer outputs and a weight vector at the end (and maybe some bias addition). I'd have to look into the details for how exactly the post-processing gets done however to confirm everything.

TIHan commented 6 months ago

Does expanding the dims in a Lambda layer have any impact, such as https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/feature_ops.py#L70 ? It's also done for identity and discard.

boomanaiden154 commented 6 months ago

It shouldn't impact the output shape. The line you linked also calls tf.concat afterwards, so the input should end up being the same shape, just normalized.

You should be able to hook up a debugger (or even just do printf style debugging) to figure out how the tensors flow through the tooling. You'll probably hit the tracing stage rather than something at runtime due to how Tensorflow will construct the graph and then execute that rather than the python code, but that will give you a decent idea of the overall shape.

TIHan commented 5 months ago

The tf.expand_dims(obs, -1) in preprocessing is what allows the NN to use Concatenate as the preprocessor combiner, I verified this. I'm actually not sure why though :).