How to link custom ops?

BlackSamorez commented 1 month ago

Hi!

I'm trying to integrate some of quantized MatMul C++ kernels into Executorch and I'm having a bad time: the documentation is very vague about what exactly I need to include/link for ATen to pick up my ops.

I would greatly appreciate any help in trying to make it work.

Overview:

Source code for the dynamic library containing the ops consists of 3 files: lut_kernel.h, lut_kernel.cpp, lut_kernel_pytorch.cpp. The files contain roughly this code:

// lut_kernel.h
#pragma once

#include <executorch/runtime/kernel/kernel_includes.h>

namespace torch {
namespace executor {

namespace native {

Tensor& code2x8_lut_matmat_out(
  RuntimeContext& ctx,
  const Tensor& input,
  const Tensor& codes,
  const Tensor& codebooks,
  const Tensor& scales,
  const optional<Tensor>& bias,
  Tensor& out
);
} // namespace native
} // namespace executor
} // namespace torch

// lut_kernel.cpp
#include "lut_kernel.h"

#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>

namespace torch {
  namespace executor {
    namespace native {
      Tensor& code2x8_lut_matmat_out(
        RuntimeContext& ctx,
        const Tensor& input,
        const Tensor& codes,
        const Tensor& codebooks,
        const Tensor& scales,
        const optional<Tensor>& bias,
        Tensor& out
      ) {
        // CALCULATIONS
        return out;
      }
    } // namespace native
  } // namespace executor
} // namespace torch

EXECUTORCH_LIBRARY(aqlm, "code2x8_lut_matmat.out", torch::executor::native::code2x8_lut_matmat_out);

// lut_kernel_pytorch.cpp
#include "lut_kernel.h"

#include <executorch/extension/aten_util/make_aten_functor_from_et_functor.h>
#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>

#include <torch/library.h>

namespace torch {
    namespace executor {
        namespace native {
            Tensor& code2x8_lut_matmat_out_no_context(
                ...
                Tensor& output
            ) {
                void* memory_pool = malloc(10000000 * sizeof(uint8_t));
                MemoryAllocator allocator(10000000, (uint8_t*)memory_pool);

                exec_aten::RuntimeContext context{nullptr, &allocator};
                return torch::executor::native::code2x8_lut_matmat_out(
                    context,
                    ...,
                    output
                );
            }

            at::Tensor code2x8_lut_matmat(
                ...
            ) {
                auto sizes = input.sizes().vec();
                sizes[sizes.size() - 1] = codes.size(1) * codebooks.size(2);
                auto out = at::empty(sizes,
                    at::TensorOptions()
                    .dtype(input.dtype())
                    .device(input.device())
                );

                WRAP_TO_ATEN(code2x8_lut_matmat_out_no_context, 5)(
                    ...,
                    out
                );
                return out;
            }
        } // namespace native
    } // namespace executor
} // namespace torch

TORCH_LIBRARY(aqlm, m) {
  m.def(
      "code2x8_lut_matmat(Tensor input, Tensor codes, "
      "Tensor codebooks, Tensor scales, Tensor? bias=None) -> Tensor"
  );
  m.def(
      "code2x8_lut_matmat.out(Tensor input, Tensor codes, "
      "Tensor codebooks, Tensor scales, Tensor? bias=None, *, Tensor(c!) out) -> Tensor(c!)"
  );
}

TORCH_LIBRARY_IMPL(aqlm, CompositeExplicitAutograd, m) {
  m.impl(
      "code2x8_lut_matmat", torch::executor::native::code2x8_lut_matmat
  );
  m.impl(
      "code2x8_lut_matmat.out",
      WRAP_TO_ATEN(torch::executor::native::code2x8_lut_matmat_out_no_context, 5)
    );
}

, which closely follows the executorch custom sdpa code.

I build it as two standalone dynamic libs: one lut_kernel.cpp with dependency only on executorch and lut_kernel_pytorch.cpp with additional torch dependency. I load the latter lib into pytorch as torch.ops.load_library(f"../libaqlm_bindings.dylib").

The problem:

I wrote a small nn.Module that basically just calls the op. In pytorch it works well. aten_dialect for it looks like this:

ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, p_codes: "i8[3072, 128, 2]", p_codebooks: "f32[2, 256, 1, 8]", p_scales: "f32[3072, 1, 1, 1]", p_bias: "f32[3072]", input: "f32[s0, s1, 1024]"):
            input_1 = input

            # File: [/Users/blacksamorez/reps/AQLM/inference_lib/src/aqlm/inference.py:74](https://file+.vscode-resource.vscode-cdn.net/Users/blacksamorez/reps/AQLM/inference_lib/src/aqlm/inference.py:74) in forward, code: return torch.ops.aqlm.code2x8_lut_matmat(
            code2x8_lut_matmat: "f32[s0, s1, 1024]" = torch.ops.aqlm.code2x8_lut_matmat.default(input_1, p_codes, p_codebooks, p_scales, p_bias);  input_1 = p_codes = p_codebooks = p_scales = p_bias = None
            return (code2x8_lut_matmat,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_codes'), target='codes', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_codebooks'), target='codebooks', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_scales'), target='scales', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_bias'), target='bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='input'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='code2x8_lut_matmat'), target=None)])
Range constraints: {s0: VR[1, 9223372036854775806], s1: VR[1, 9223372036854775806]}

But when calling to_edge I get an error saying that Operator torch._ops.aqlm.code2x8_lut_matmat.default is not Aten Canonical.

I don't conceptually understand how the EXECUTORCH_LIBRARY macro from lut_kernel.cpp supposed to make it Aten Canonical. Should I somehow recompile executorch to include my kernel?

Thank you!

BlackSamorez commented 1 month ago

I added compile_config=EdgeCompileConfig(_check_ir_validity=False) to to_edge and it appears to be exporting now. Linking libaqlm.dylib to executor_runner (and replacing executorch with executorch_no_prim_ops in it's libs) I'm able to compile it. However, running it, I'm encountering an error that goes like this:

E 00:00:00.001621 executorch:method.cpp:536] Missing operator: [0] aqlm::code2x8_lut_matmat.out
E 00:00:00.001623 executorch:method.cpp:724] There are 1 instructions don't have corresponding operator registered. See logs for details

I'm on executorch v0.3.0.

digantdesai commented 2 weeks ago

@larryliu0820 any suggestions?

BlackSamorez commented 2 weeks ago

@digantdesai Hi! Thanks for the reply. I think we shifted the discussion to #4719 . In light of that, I'm closing this issue.

pytorch / executorch

How to link custom ops? #4510

Overview:

The problem: