huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.89k stars 963 forks source link

Recent revision for contiguous check has problems #1991

Open guoqingbao opened 7 months ago

guoqingbao commented 7 months ago

The following code for the contiguous check in shape.rs will trigger problems for the squeezed tensor (n-dim to 1-dim) because of the " if dim > 1" condition (recently added ).

    /// Returns true if the strides are C contiguous (aka row major).
    pub fn is_contiguous(&self, stride: &[usize]) -> bool {
        if self.0.len() != stride.len() {
            return false;
        }
        let mut acc = 1;
        for (&stride, &dim) in stride.iter().zip(self.0.iter()).rev() {
            if dim > 1 && stride != acc { //Need to remove "dim > 1" condition
                return false;
            }
            acc *= dim;
        }
        true
    }

    /// Returns true if the strides are Fortran contiguous (aka column major).
    pub fn is_fortran_contiguous(&self, stride: &[usize]) -> bool {
        if self.0.len() != stride.len() {
            return false;
        }
        let mut acc = 1;
        for (&stride, &dim) in stride.iter().zip(self.0.iter()) {
            if dim > 1 && stride != acc { //Need to remove "dim > 1" condition
                return false;
            }
            acc *= dim;
        }
        true
    }
LaurentMazare commented 7 months ago

Yes that's expected, on the positive side it also avoid some unnecessary copy in some cases. The idea is to explicitely call contiguous if you really need a contiguous tensor in the end.

guoqingbao commented 7 months ago

Yes that's expected, on the positive side it also avoid some unnecessary copy in some cases. The idea is to explicitely call contiguous if you really need a contiguous tensor in the end.

While, the check for the non-contiguous squeezed tensor (1-dim) returns contiguous, meaning that we need to explicitly & immediately call the contiguous method if the tensor was squeezed into 1-dim?

LaurentMazare commented 7 months ago

Sorry the tensor is actually contiguous so there should be no need to call contiguous on it, you can force a copy of it though if for some reason you need a more "packed" version.

LaurentMazare commented 7 months ago

Related error: #1992 , the annoying bit is that the matmul kernels consider that the layout is not contiguous whereas it actually is, I'll have to make a couple tweaks there and hopefully this will avoid the unnecessary copy.

guoqingbao commented 7 months ago

Related error: #1992 , the annoying bit is that the matmul kernels consider that the layout is not contiguous whereas it actually is, I'll have to make a couple tweaks there and hopefully this will avoid the unnecessary copy.

Can we consider removing the "if dim > 1" condition, as it seems to be causing issues in specific scenarios? In my case, I require the conversion of every tensor into a contiguous form because our custom DSA does not support non-contiguous kernels, which is different from CUDA. The Candle framework has been performing very well on our hardware until this condition was implemented (I found some 1d tensors were squeezed from non-contiguous 2d and it fasely regarded as contiguous).

The reason Candle has not yet encountered issues on GPU, in my view, is that each implementation of CUDA kernels within Candle includes a corresponding non-contiguous version, and these non-contiguous versions do not include the "if dim > 1" condition (they utilize stride), which making the problem trivial on GPU.

guoqingbao commented 7 months ago

I found more problems with the revised version of tensor::squeeze and tensor::unsqueeze that introduced in https://github.com/huggingface/candle/pull/1884

I'm not sure about the benefits of the current revision but it causes problems in certain cases especially when dealing with non-contiguous tensor.

    pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
        // The PyTorch semantics are to return the same tensor if the target dimension
        // does not have a size of 1.
        let dims = self.dims();
        let dim = dim.to_index(self.shape(), "squeeze")?;
        if dims[dim] == 1 {
            let mut dims = dims.to_vec();
            let mut strides = self.stride().to_vec();
            dims.remove(dim);
            strides.remove(dim);
            let tensor_ = Tensor_ {
                id: TensorId::new(),
                storage: self.storage.clone(),
                layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
                op: BackpropOp::new1(self, Op::Reshape),
                is_variable: false,
                dtype: self.dtype,
                device: self.device.clone(),
            };
            Ok(Tensor(Arc::new(tensor_)))
        } else {
            Ok(self.clone())
        }
    }

    pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
        let mut dims = self.dims().to_vec();
        let mut strides = self.stride().to_vec();
        let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
        // Cannot panic because to_index_plus_one already checks dimensions
        dims.insert(dim, 1);
        // Any stride would work here, but we pick one so as to maximize the probability to remain
        // C contiguous.
        let stride = if dim < strides.len() { strides[dim] } else { 1 };
        strides.insert(dim, stride);
        let tensor_ = Tensor_ {
            id: TensorId::new(),
            storage: self.storage.clone(),
            layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
            op: BackpropOp::new1(self, Op::Reshape),
            is_variable: false,
            dtype: self.dtype,
            device: self.device.clone(),
        };
        Ok(Tensor(Arc::new(tensor_)))
    }

The previous version works really well:

    pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
        // The PyTorch semantics are to return the same tensor if the target dimension
        // does not have a size of 1.
        let dims = self.dims();
        let dim = dim.to_index(self.shape(), "squeeze")?;
        if dims[dim] == 1 {
            let mut dims = dims.to_vec();
            dims.remove(dim);
            self.reshape(dims)
        } else {
            Ok(self.clone())
        }
    }

    pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
        let mut dims = self.dims().to_vec();
        let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
        // Cannot panic because to_index_plus_one already checks dimensions
        dims.insert(dim, 1);
        self.reshape(dims)
    }
LaurentMazare commented 7 months ago

But the point is that these tensors are actually contiguous so making a copy for them when calling contiguous() is not necessary and can be avoided. If dim is 1 then the index in that dimension is always 0 and so the stride should never be used. Is that not the case for your backend? Could you report similar issues with the CPU backend? To me the main issue is that the matmul operator was doing contiguity check that were not in sync with the changes but I patched those yesterday and it seemed all good after that. I would be ok with removing the change but like to see some actual breakage before this to check if there is anything I'm missing (and I'll remove the check from the cuda side too as it shouldn't be used anyway)

LaurentMazare commented 7 months ago

Also I relaxed the contiguity check in the same way in the function you mentioned on the cuda side in #2000 , all the tests passed fined as well as the couple models I've tried. Certainly keen to see some breakage example on cuda too after this change.

guoqingbao commented 7 months ago

Also I relaxed the contiguity check in the same way in the function you mentioned on the cuda side in #2000 , all the tests passed fined as well as the couple models I've tried. Certainly keen to see some breakage example on cuda too after this change.

In my example, the introduction of "if dim > 1" condition for contiguity check causes some generation problems, e.g., the generated contents are slightly different from the expected. Have you checked the output of the model?

The expected output (without "if dim > 1" condition)

loaded the model in 2.834793587s Please talk about deep learning in 1000 words.

Deep Learning: A Transformative Technology

Deep learning (DL) is a subfield of machine learning (ML) that enables computers to learn from data without explicit programming. By mimicking the structure and function of the human brain, DL algorithms can discover patterns and relationships in data, enabling them to make accurate predictions and decisions.

Architecture of a Deep Learning Model

A typical deep learning model consists of multiple layers of interconnected nodes or neurons. Each layer receives input data, passes it through a series of transformations, and produces an output. The model can have different architectures, but they typically share these key components:

Types of Deep Learning Models

There are various types of DL models, each with its strengths and weaknesses:

Applications of Deep Learning

Deep learning has numerous applications in various domains, including:

Advantages of Deep Learning

Disadvantages of Deep Learning

Conclusion

Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for problem-solving. While DL models can achieve high accuracy, they also have limitations that need to be considered. As deep learning continues to evolve, it will likely play an increasingly important role in shaping the future of technology.

The current output (with "if dim > 1" condition)

loaded the model in 2.801835715s Please talk about deep learning in 1000 words. increably, the field of artificial intelligence (AI) has witnessed a surge in the use of deep learning (DL). DL is a subfield of machine learning (ML) that allows computers to learn from data without explicit programming.

Key Concepts of Deep Learning:

Applications of Deep Learning:

Benefits of Deep Learning:

Challenges of Deep Learning:

Future of Deep Learning:

Conclusion:

Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for solving complex problems. While DL presents challenges such as data requirements, bias, and explainability, its future holds immense potential to further advance AI and unlock unprecedented advancements in various fields.

LaurentMazare commented 7 months ago

I checked the outputs for mistral and llama2 (7b), both quantized and I quantized and didn't see a change. I'll check again but maybe you have a way for me to reproduce the issue? Also there might be some numerical stability changes but nothing large as we might be using the transposed matmul kernels vs the original one in some cases. Would be very helpful if you have some more unitary failures too, all the test suite is passing fine but it can certainly be expanded.

guoqingbao commented 7 months ago

I checked the outputs for mistral and llama2 (7b), both quantized and I quantized and didn't see a change. I'll check again but maybe you have a way for me to reproduce the issue? Also there might be some numerical stability changes but nothing large as we might be using the transposed matmul kernels vs the original one in some cases. Would be very helpful if you have some more unitary failures too, all the test suite is passing fine but it can certainly be expanded.

I found where the problem is:

For a given tensor like [13, 1] if we perform a candle to_dtype on its broadcasted version (something like shape [1, 1, 13, 13] with a stride of [0, 0, 13, 1]) the previous contiguity check will regard it as non-contiguous triggering the broadcast operation in my backend (making it contiguous implicitly), while, the current check treats it as contiguous tensor because of dim == 1 in [0, 0, 13, 1] and this will cause a problem on my backend. In comparison, the cast CUDA kernel in candle is able to deal with non-contiguous tensors and can cast them into contiguous ones with get_strided_index.

Similarly, the tensor::transpose op does not change the stride of the original tensor and the transposed tensor was treated as contiguous too in the revised version making problems in my backend (and also triggering get_strided_index in the underlying CUDA kernels to make it contiguous too).

It seems that the copy was not avoided although the tensor was marked as contiguous (the underlying CUDA kernels do the dirty works using get_strided_index).

LaurentMazare commented 7 months ago

Thanks for investigating and for the details. I'm not sure to get the full picture of it but to me it feels that the tensor layout you mention actually is contiguous in memory and so could be handled by contiguous kernels (so would work well in cuda post #2000 ), so the issue might be some assumptions on your backend side but maybe I'm missing something?

guoqingbao commented 7 months ago

Thanks for investigating and for the details. I'm not sure to get the full picture of it but to me it feels that the tensor layout you mention actually is contiguous in memory and so could be handled by contiguous kernels (so would work well in cuda post #2000 ), so the issue might be some assumptions on your backend side but maybe I'm missing something?

No, the tensor was not contiguous, as I stated earlier: "the tensor::transpose operation does not change the stride of the original tensor, and the transposed tensor was treated as contiguous." The transposed tensor (with dimensions 1 and 2 swapped) should be considered non-contiguous because its memory layout necessitates alteration. The CUDA kernels implicitly made these transposed tensors contiguous, which is why no obvious issues were detected in the candle GPU backend. However, the correct logic dictates that a transpose resulting in a non-contiguous tensor, which was labeled as contiguous in the recent revision. I suspect that if all CUDA kernel contiguity checks adhere to this revision, it will lead to problems.

LaurentMazare commented 7 months ago

A tensor with shape [1, 1, 13, 13] with stride [0, 0, 13, 1] is contiguous in memory with a C layout. I've removed the bit you mentioned in the contiguous detection in the cuda kernels in #2000 and tested a whole lot of models. All of them are running fine. Again very happy if I get an example of something that doesn't work properly on cpu or gpu. I've merged #2000 and now am running all the tests and models from our internal codebase and so far it's all good (it will take a bit of time to run everything though).