Linear layer with same weights, biases, and inputs gives different output than Pytorch

EricLBuehler commented 3 months ago

Hello all,

Thank you for your great work here. I was completing some testing of the Phi 3 Vision model in mistral.rs, but it appears that the error stems from a linear layer. I have verified that the inputs, weights, and biases are the same, but the output is different by exporting to Numpy and then comparing. I have attached the necessary files to reproduce this, as well as the Rust and Python scripts for reproducing and showing how they match, respectively.

mistral.rs weights: mistral.rs.zip
phi3 vision weights (ground truth): Phi-3-vision-128k-instruct.zip
Output of Rust reproduction script testingout.zip

Note: here is what each file name means:

inp/imp (Rust and Python respectively), input to layer
layerhiddenweight: weight of Linear layer
layerhiddenbias: weight of Linear layer
xs: Output of linear layer (this is what differs)
testingout.npy: Output of Rust reproduction script (this also differs)

Rust program to reproduce the error

This just loads from the Numpy files, converts to BF16, and does the Linear layer forward pass. Using the weights from Phi-3-vision-128k-instruct has no effect.

I ran this with the cuda feature enabled.

use candle_core::{Device, Tensor, DType, Module};
use candle_nn::Linear;

fn main() {
    let dev = Device::cuda_if_available(0).unwrap();
    let weight = Tensor::read_npy("../mistral.rs/layerhiddenweight.npy").unwrap().to_device(&dev).unwrap().to_dtype(DType::BF16).unwrap();
    let bias = Tensor::read_npy("../mistral.rs/layerhiddenbias.npy").unwrap().to_device(&dev).unwrap().to_dtype(DType::BF16).unwrap();
    let layer = Linear::new(weight, Some(bias));

    let inp = Tensor::read_npy("../mistral.rs/inp.npy").unwrap().to_device(&dev).unwrap().to_dtype(DType::BF16).unwrap();
    let res = layer.forward(&inp).unwrap();
    dbg!(&res.to_dtype(DType::F32).unwrap().mean_all());

    let truth = Tensor::read_npy("../mistral.rs/xs.npy").unwrap().to_device(&dev).unwrap().to_dtype(DType::BF16).unwrap();
    dbg!(&truth.to_dtype(DType::F32).unwrap().mean_all());

    res.to_dtype(DType::F32).unwrap().write_npy("testingout.npy").unwrap();
    println!("Wrote output.");
}

Python script to compare outputs

import numpy as np

mistralrs = np.load("mistral.rs/inp.npy")
py = np.load("Phi-3-vision-128k-instruct/imp.npy")

print(mistralrs.shape, py.shape)

print("inp",np.allclose(mistralrs, py))

mistralrs = np.load("mistral.rs/layerhiddenweight.npy")
py = np.load("Phi-3-vision-128k-instruct/layerhiddenweight.npy")

print(mistralrs.shape, py.shape)

print("weight",np.allclose(mistralrs, py))

mistralrs = np.load("mistral.rs/layerhiddenbias.npy")
py = np.load("Phi-3-vision-128k-instruct/layerhiddenbias.npy")

print(mistralrs.shape, py.shape)

print("bias",np.allclose(mistralrs, py))

mistralrs = np.load("mistral.rs/xs.npy")
py = np.load("Phi-3-vision-128k-instruct/xs.npy")

print(mistralrs.shape, py.shape)

print("out1",np.allclose(mistralrs, py))
print(mistralrs[:,5:10,:5]-py[:,5:10,:5])

mistralrs = np.load("testing/testingout.npy")
py = np.load("Phi-3-vision-128k-instruct/xs.npy")

print(mistralrs.shape, py.shape)

print("out2",np.allclose(mistralrs, py))
print(mistralrs[:,5:10,:5]-py[:,5:10,:5])

Result of Python script

As you can see, the inputs, weights, and biases are the same but the outputs differ in both mistral.rs and in the reproduction script.

(1, 1921, 4096) (1, 1921, 4096)
inp True
(3072, 4096) (3072, 4096)
weight True
(3072,) (3072,)
bias True
(1, 1921, 3072) (1, 1921, 3072)
out1 False
[[[0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]]]
(1, 1921, 3072) (1, 1921, 3072)
out2 False
[[[0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]
  [0.       0.       0.       0.015625 0.      ]]]

LaurentMazare commented 3 months ago

I don't think there are much guarantees on the values being exactly equal on both sides. Especially when using bfloat16, there are only 8 bits for the mantissa so if the error is of the order of a percent that would be somewhat expected. Things that could be worth checking:

How much is the error when using f32 on both sides?
What happens without biases, the error being in the same column might indicate that this is the culprit and it may be caused by pytorch fusing the add and mul part whereas candle doesn't do it for now.
Are the f32 to bf16 conversions in line between both sides? (using a safetensors rather than npy would allow storing the bf16 tensors)

EricLBuehler commented 3 months ago

Thank you for giving me those tips, I think I figured out what the problem is:

How much is the error when using f32 on both sides?

About the same, there is no change to the output.

What happens without biases, the error being in the same column might indicate that this is the culprit and it may be caused by pytorch fusing the add and mul part whereas candle doesn't do it for now.

It turns out that I only do a matmul and then add the bias separately, the 2 tensors are the same in the output! So when we folow the xW^T + B strictly (as Candle does) in Pytorch, the output is the same.

Would it be possible to add this fusion of add and mul to Candle too, as it would fix this? Alternatively, is there something which I can do to fix this? Thank you so much!

EricLBuehler commented 3 months ago

I wrote some code to test fusion using cuBLASLT with a FusedLinearBias layer: https://github.com/EricLBuehler/mistral.rs/blob/44e8a2291d6d53fa125907925c0a4cc613cb8855/mistralrs-core/src/layers.rs#L401-L451

This gets rid of the error. Would you be interested in me submitting a PR to add the fused-linear-bias support directly to Linear?

huggingface / candle