karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
24.33k stars 2.74k forks source link

write LLVM optimization passes for train_gpt2 #18

Open ent0n29 opened 7 months ago

ent0n29 commented 7 months ago

Here is a little example:

multiplications where one operand is a power of 2 and a constant integer, are optimized with a shift operation and the shift amount is calculated using the logBase2 of the constant.

bool optBasicStrengthReduction(Instruction &I) {
  auto OpCode = I.getOpcode();

  if (OpCode != Instruction::Mul) return false;

  Value *Op1 = I.getOperand(0);
  Value *Op2 = I.getOperand(1);
  ConstantInt *CI = nullptr;

  // Check if op is a constant integer and is a power of 2
  auto isConstPowOf2 = [&CI](Value *op) {
    return (CI = dyn_cast<ConstantInt>(op))
      and CI->getValue().isPowerOf2()
      and not CI->isOne();
  };

  if (isConstPowOf2(Op1)) std::swap(Op1, Op2);
  if (not isConstPowOf2(Op2)) return false;

  errs() << "Triggered train_gpt2 optimization\n";

  // Shift amount calculation
  unsigned ShiftAmount = CI->getValue().logBase2();

  // Create a new shift instruction
  Instruction *ShiftInst = BinaryOperator::Create(
    Instruction::Shl,
    Op1, ConstantInt::get(CI->getType(), ShiftAmount)
  );

  ShiftInst->insertAfter(&I);
  I.replaceAllUsesWith(ShiftInst);

  return true;
}

and we need to add a call to the opt in a runOnBasicBlock function:

bool runOnBasicBlock(BasicBlock &B) {
  bool globallyModified = false;
  std::set<Instruction*> toBeErased;

  for (auto &I : B) {
    bool locallyModified =
      // here you can add all your opt passes
      optBasicStrengthReduction(I)
        || optExample2(I)
        || optExample3(I)
        || optExample4(I)
        ...

    // dead code elimination
    if (locallyModified) {
      toBeErased.insert(&I);
      globallyModified = true;
    }
  }

  for (auto *I : toBeErased) {
    I->eraseFromParent();
  }

  return globallyModified;
}

to apply the passes we need to convert train_gpt2 to a LLVM-IR using the clang compiler:

$ clang -emit-llvm -c train_gpt2.c -o train_gpt2.bc
#apply the opt pass
$ opt -load ./build/LocalOpts.so -local-opts train_gpt2.bc -o train_gpt2_opt.bc
#obtain the optimized train_gpt2.c
$ clang train_gpt2_opt.bc -o train_gpt2_opt
chadbrewbaker commented 7 months ago

I was discussing this yesterday with @jonmasters. Ideally this would be a script that takes llm.c and transforms it into specialized but still legible C code for a particular architecture. It could do buffer size tuning etc like MojošŸ”„

It would also be nice to have a memory/cache layout visualizer.

@blasty has some great human friendly inline assembler examples https://github.com/blasty/unwyze/blob/638e7d17e752a30a3e758f51e436f752954afbd4/exploit/src/main.c#L180

ent0n29 commented 7 months ago

looking into it!