tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
300 stars 23 forks source link

[Feature Request] Support matmul with pre-transposed weights. #9709

Open marty1885 opened 6 days ago

marty1885 commented 6 days ago

Is your feature request related to a problem? Please describe.

I'm working on integrating TTNN/Metalium into GGML/llama.cpp. Currently in the very early stages of hooking everything up and see where is breaks. And for now looking into integrating the matrix multiplication API.

And I ran into a problem. Weights in GGML are pre-transposed. Thus during inference, I have to transpose the weights before sending it for matrix multiplication. Neither tt::tt_metal::fully_connected or ttnn::operations::matmul::matmul has a parameter to allow multiplying with a pre-transposed matrix. Forcing me to transpose it every time running inference.

// in GGML a is the weight and b is theactivation
auto aT = tt::tt_metal::transpose(a, -2, -1);
auto out = tt::tt_metal::fully_connected(b, aT)

Describe the solution you'd like

It'll be great if TTNN natively support multiplying with pre-transposed matrices.

I understand I can do some tricks like tagging the tensors and only transpose once during inference (can't do during load as I won't know what operation will be used on the weight yet. But that's ugly and error prone. Epically for cases where GGML would want to read back the weights (ex: generating LoRA).

Describe alternatives you've considered

Anything is on the table as long as I can make GGML happy without additional compute.

Additional context Add any other context or screenshots about the feature request here.

TT-BrianLiu commented 6 days ago

I don't fully understand the specs and terminology for GGML (to me, input order and transpose vs. non-transposed all looks flipped), but this is what happens for our ops (applies for fully_connected or matmul):

marty1885 commented 6 days ago

What you need is a matmul that supports [M, K] x [N, K]. Is this correct?

Yes.

The term pre-transposed comes from traditional HPC. In normal MatMul the right matrix is read on column order. Which is slow because DRAM and cache pattern. One common solution if that matrix will be reused, is to store the transposed matrix. And perform all operations against that transposed one. Thus memory access would be linear and fast. GGML has this optimization built into the stored model files.

TT-BrianLiu commented 6 days ago

We can build support for this into our matmul ops. What do you propose to help distinguish regular matmul or the pre-transposed one API-wise?

marty1885 commented 5 days ago

I think most projects ended up adding a parameter to denote if tensor A/B are pre-transposed. I'm only interested in transposed B. But feel free to add transposed A support.

Ref: CBLAS and TensotFlow API

void cblas_sgemm(CBLAS_LAYOUT layout,
        CBLAS_TRANSPOSE     TransA,
        CBLAS_TRANSPOSE     TransB,  <--- add this as an optional parameter
        const CBLAS_INT     M,
        const CBLAS_INT     N,
        const CBLAS_INT     K,
        const float     alpha,
        const float *   A,
        const CBLAS_INT     lda,
        const float *   B,
        const CBLAS_INT     ldb,
        const float     beta,
        float *     C,
        const CBLAS_INT     ldc 
    )       
tf.compat.v1.linalg.matmul(
    a,
    b,
    transpose_a=False,
    transpose_b=False, <-- add this
    adjoint_a=False,
    adjoint_b=False,
    a_is_sparse=False,
    b_is_sparse=False,
    output_type=None,
    grad_a=False,
    grad_b=False,
    name=None
)
TT-BrianLiu commented 5 days ago

I will look into this and let you know when it's supported

TT-BrianLiu commented 3 days ago

Take a look at this PR and my comments on what was done: https://github.com/tenstorrent/tt-metal/pull/9836. Once this is merged, it should unblock you. (I didn't rigorously test every API with this new feature, but it should work.)

marty1885 commented 3 days ago

@TT-BrianLiu Sorry I caused confusion. I don't think the PR addresses the issue. I already manually transpose the wright matrix. And the reason for this feature request is to get rid of the transpose for better performance. Or least only run transpose once.

The tensor tracking part is an alternative I can do to achieve better performance. But very error prone and thus not desirable.

My bad.

TT-BrianLiu commented 2 hours ago

Ah ok. I think it's still cleaner to have the transposes baked into our matmul API and it will help us think about how to support the transposes natively in the matmul in the future.