Recent revision for contiguous check has problems

guoqingbao commented 7 months ago

The following code for the contiguous check in shape.rs will trigger problems for the squeezed tensor (n-dim to 1-dim) because of the " if dim > 1" condition (recently added ).

    /// Returns true if the strides are C contiguous (aka row major).
    pub fn is_contiguous(&self, stride: &[usize]) -> bool {
        if self.0.len() != stride.len() {
            return false;
        }
        let mut acc = 1;
        for (&stride, &dim) in stride.iter().zip(self.0.iter()).rev() {
            if dim > 1 && stride != acc { //Need to remove "dim > 1" condition
                return false;
            }
            acc *= dim;
        }
        true
    }

    /// Returns true if the strides are Fortran contiguous (aka column major).
    pub fn is_fortran_contiguous(&self, stride: &[usize]) -> bool {
        if self.0.len() != stride.len() {
            return false;
        }
        let mut acc = 1;
        for (&stride, &dim) in stride.iter().zip(self.0.iter()) {
            if dim > 1 && stride != acc { //Need to remove "dim > 1" condition
                return false;
            }
            acc *= dim;
        }
        true
    }

LaurentMazare commented 7 months ago

Yes that's expected, on the positive side it also avoid some unnecessary copy in some cases. The idea is to explicitely call contiguous if you really need a contiguous tensor in the end.

guoqingbao commented 7 months ago

Yes that's expected, on the positive side it also avoid some unnecessary copy in some cases. The idea is to explicitely call contiguous if you really need a contiguous tensor in the end.

While, the check for the non-contiguous squeezed tensor (1-dim) returns contiguous, meaning that we need to explicitly & immediately call the contiguous method if the tensor was squeezed into 1-dim?

LaurentMazare commented 7 months ago

Sorry the tensor is actually contiguous so there should be no need to call contiguous on it, you can force a copy of it though if for some reason you need a more "packed" version.

LaurentMazare commented 7 months ago

Related error: #1992 , the annoying bit is that the matmul kernels consider that the layout is not contiguous whereas it actually is, I'll have to make a couple tweaks there and hopefully this will avoid the unnecessary copy.

guoqingbao commented 7 months ago

Related error: #1992 , the annoying bit is that the matmul kernels consider that the layout is not contiguous whereas it actually is, I'll have to make a couple tweaks there and hopefully this will avoid the unnecessary copy.

Can we consider removing the "if dim > 1" condition, as it seems to be causing issues in specific scenarios? In my case, I require the conversion of every tensor into a contiguous form because our custom DSA does not support non-contiguous kernels, which is different from CUDA. The Candle framework has been performing very well on our hardware until this condition was implemented (I found some 1d tensors were squeezed from non-contiguous 2d and it fasely regarded as contiguous).

The reason Candle has not yet encountered issues on GPU, in my view, is that each implementation of CUDA kernels within Candle includes a corresponding non-contiguous version, and these non-contiguous versions do not include the "if dim > 1" condition (they utilize stride), which making the problem trivial on GPU.

guoqingbao commented 7 months ago

I found more problems with the revised version of tensor::squeeze and tensor::unsqueeze that introduced in https://github.com/huggingface/candle/pull/1884

I'm not sure about the benefits of the current revision but it causes problems in certain cases especially when dealing with non-contiguous tensor.

    pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
        // The PyTorch semantics are to return the same tensor if the target dimension
        // does not have a size of 1.
        let dims = self.dims();
        let dim = dim.to_index(self.shape(), "squeeze")?;
        if dims[dim] == 1 {
            let mut dims = dims.to_vec();
            let mut strides = self.stride().to_vec();
            dims.remove(dim);
            strides.remove(dim);
            let tensor_ = Tensor_ {
                id: TensorId::new(),
                storage: self.storage.clone(),
                layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
                op: BackpropOp::new1(self, Op::Reshape),
                is_variable: false,
                dtype: self.dtype,
                device: self.device.clone(),
            };
            Ok(Tensor(Arc::new(tensor_)))
        } else {
            Ok(self.clone())
        }
    }

    pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
        let mut dims = self.dims().to_vec();
        let mut strides = self.stride().to_vec();
        let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
        // Cannot panic because to_index_plus_one already checks dimensions
        dims.insert(dim, 1);
        // Any stride would work here, but we pick one so as to maximize the probability to remain
        // C contiguous.
        let stride = if dim < strides.len() { strides[dim] } else { 1 };
        strides.insert(dim, stride);
        let tensor_ = Tensor_ {
            id: TensorId::new(),
            storage: self.storage.clone(),
            layout: Layout::new(dims.into(), strides, self.layout.start_offset()),
            op: BackpropOp::new1(self, Op::Reshape),
            is_variable: false,
            dtype: self.dtype,
            device: self.device.clone(),
        };
        Ok(Tensor(Arc::new(tensor_)))
    }

The previous version works really well:

    pub fn squeeze<D: Dim>(&self, dim: D) -> Result<Self> {
        // The PyTorch semantics are to return the same tensor if the target dimension
        // does not have a size of 1.
        let dims = self.dims();
        let dim = dim.to_index(self.shape(), "squeeze")?;
        if dims[dim] == 1 {
            let mut dims = dims.to_vec();
            dims.remove(dim);
            self.reshape(dims)
        } else {
            Ok(self.clone())
        }
    }

    pub fn unsqueeze<D: Dim>(&self, dim: D) -> Result<Self> {
        let mut dims = self.dims().to_vec();
        let dim = dim.to_index_plus_one(self.shape(), "unsqueeze")?;
        // Cannot panic because to_index_plus_one already checks dimensions
        dims.insert(dim, 1);
        self.reshape(dims)
    }

LaurentMazare commented 7 months ago

But the point is that these tensors are actually contiguous so making a copy for them when calling contiguous() is not necessary and can be avoided. If dim is 1 then the index in that dimension is always 0 and so the stride should never be used. Is that not the case for your backend? Could you report similar issues with the CPU backend? To me the main issue is that the matmul operator was doing contiguity check that were not in sync with the changes but I patched those yesterday and it seemed all good after that. I would be ok with removing the change but like to see some actual breakage before this to check if there is anything I'm missing (and I'll remove the check from the cuda side too as it shouldn't be used anyway)

LaurentMazare commented 7 months ago

Also I relaxed the contiguity check in the same way in the function you mentioned on the cuda side in #2000 , all the tests passed fined as well as the couple models I've tried. Certainly keen to see some breakage example on cuda too after this change.

guoqingbao commented 7 months ago

Also I relaxed the contiguity check in the same way in the function you mentioned on the cuda side in #2000 , all the tests passed fined as well as the couple models I've tried. Certainly keen to see some breakage example on cuda too after this change.

In my example, the introduction of "if dim > 1" condition for contiguity check causes some generation problems, e.g., the generated contents are slightly different from the expected. Have you checked the output of the model?

The expected output (without "if dim > 1" condition)

loaded the model in 2.834793587s Please talk about deep learning in 1000 words.

Deep Learning: A Transformative Technology

Deep learning (DL) is a subfield of machine learning (ML) that enables computers to learn from data without explicit programming. By mimicking the structure and function of the human brain, DL algorithms can discover patterns and relationships in data, enabling them to make accurate predictions and decisions.

Architecture of a Deep Learning Model

A typical deep learning model consists of multiple layers of interconnected nodes or neurons. Each layer receives input data, passes it through a series of transformations, and produces an output. The model can have different architectures, but they typically share these key components:

Input Layer: Receives the raw data that the model will learn from.
Hidden Layers: Multiple layers of interconnected nodes perform computations on the input data.
Output Layer: Produces the final predictions or decisions based on the learned patterns.
Activation Functions: Apply non-linear transformations to the input data, introducing non-linearity into the model.
Loss Function: Measures the difference between the model's predictions and the actual target values.
Optimizer: Minimizes the loss function by adjusting the model's weights and biases to reduce the error.

Types of Deep Learning Models

There are various types of DL models, each with its strengths and weaknesses:

Convolutional Neural Networks (CNNs): Designed for image recognition tasks, CNNs use filters to extract features from images.
Recurrent Neural Networks (RNNs): Used for sequence data, RNNs can learn temporal relationships between data points.
Long Short-Term Memory (LSTM) Networks: A variant of RNNs that addresses the vanishing gradient problem in long sequences.
Autoencoders: Designed to learn compressed representations of data, autoencoders can be used for dimensionality reduction and data compression.
Generative Adversarial Networks (GANs): Two neural networks compete against each other, with one generating realistic data and the other trying to distinguish between real and generated data.

Applications of Deep Learning

Deep learning has numerous applications in various domains, including:

Image Recognition: Self-driving cars, facial recognition, medical diagnosis
Natural Language Processing (NLP): Machine translation, sentiment analysis, text summarization
Time Series Analysis: Forecasting, anomaly detection, weather prediction
Drug Discovery: Identifying new drug targets, predicting drug efficacy
Financial Risk Management: Fraud detection, risk assessment

Advantages of Deep Learning

High accuracy: Can achieve high accuracy in complex tasks.
Data-driven: Learns from data without explicit programming.
Pattern discovery: Identifies patterns and relationships in data.
Adaptive: Can be adapted to different data types and tasks.

Disadvantages of Deep Learning

Black box nature: The inner workings of DL models are often difficult to understand.
Data dependence: Performance can be sensitive to the quality and size of the training dataset.
Ethical concerns: Potential for bias, privacy issues, and job displacement.

Conclusion

Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for problem-solving. While DL models can achieve high accuracy, they also have limitations that need to be considered. As deep learning continues to evolve, it will likely play an increasingly important role in shaping the future of technology.

The current output (with "if dim > 1" condition)

loaded the model in 2.801835715s Please talk about deep learning in 1000 words. increably, the field of artificial intelligence (AI) has witnessed a surge in the use of deep learning (DL). DL is a subfield of machine learning (ML) that allows computers to learn from data without explicit programming.

Key Concepts of Deep Learning:

Artificial Neural Networks: Deep learning algorithms are inspired by the structure and function of the human brain. They consist of interconnected nodes or "artificial neurons" that process and transmit information.
Multi-Layer Perceptron (MLP): A basic neural network architecture that consists of multiple layers of interconnected nodes. Each layer receives input data, passes it through a series of transformations, and outputs a decision.
Convolutional Neural Network (CNN): A type of neural network used for image recognition. CNNs have a grid-like structure with filters that slide over the input image, extracting features.
Recurrent Neural Network (RNN): A type of neural network used for sequence data. RNNs can process information in order, taking into account the context of the entire sequence.

Applications of Deep Learning:

Image Recognition: Self-driving cars, facial recognition systems, medical diagnosis, and more.
Natural Language Processing (NLP): Machine translation, sentiment analysis, chatbots, and more.
Time Series Analysis: Forecasting, anomaly detection, and climate modeling.
Drug Discovery: Identifying new drug candidates and predicting their effectiveness.

Benefits of Deep Learning:

Automatic Feature Extraction: DL automatically discovers and extracts features from data, eliminating the need for manual feature engineering.
Improved Accuracy: DL algorithms can achieve higher accuracy than traditional ML methods by learning complex relationships in data.
Robustness to Noise and Outliers: DL is less susceptible to noise and outliers in data compared to traditional ML methods.

Challenges of Deep Learning:

Data Requirements: Training deep learning models requires massive datasets with high quality and diversity.
Explainability: It can be difficult to understand how DL algorithms arrive at their decisions, making it challenging to interpret results.
Bias and Fairness: DL models can inherit biases from the training data, leading to unfair or discriminatory outcomes.

Future of Deep Learning:

Increased Computing Power: As computing power increases, we will see more powerful and efficient DL models.
New Data Sources: The use of sensor data and big data will create vast new datasets for training DL models.
Explainable AI (XAI): Developing methods to make DL models more transparent and interpretable.

Conclusion:

Deep learning is a transformative technology that has revolutionized various industries. Its ability to learn from data without explicit programming has opened up new possibilities for solving complex problems. While DL presents challenges such as data requirements, bias, and explainability, its future holds immense potential to further advance AI and unlock unprecedented advancements in various fields.

LaurentMazare commented 7 months ago

I checked the outputs for mistral and llama2 (7b), both quantized and I quantized and didn't see a change. I'll check again but maybe you have a way for me to reproduce the issue? Also there might be some numerical stability changes but nothing large as we might be using the transposed matmul kernels vs the original one in some cases. Would be very helpful if you have some more unitary failures too, all the test suite is passing fine but it can certainly be expanded.

guoqingbao commented 7 months ago

I checked the outputs for mistral and llama2 (7b), both quantized and I quantized and didn't see a change. I'll check again but maybe you have a way for me to reproduce the issue? Also there might be some numerical stability changes but nothing large as we might be using the transposed matmul kernels vs the original one in some cases. Would be very helpful if you have some more unitary failures too, all the test suite is passing fine but it can certainly be expanded.

I found where the problem is:

For a given tensor like [13, 1] if we perform a candle to_dtype on its broadcasted version (something like shape [1, 1, 13, 13] with a stride of [0, 0, 13, 1]) the previous contiguity check will regard it as non-contiguous triggering the broadcast operation in my backend (making it contiguous implicitly), while, the current check treats it as contiguous tensor because of dim == 1 in [0, 0, 13, 1] and this will cause a problem on my backend. In comparison, the cast CUDA kernel in candle is able to deal with non-contiguous tensors and can cast them into contiguous ones with get_strided_index.

Similarly, the tensor::transpose op does not change the stride of the original tensor and the transposed tensor was treated as contiguous too in the revised version making problems in my backend (and also triggering get_strided_index in the underlying CUDA kernels to make it contiguous too).

It seems that the copy was not avoided although the tensor was marked as contiguous (the underlying CUDA kernels do the dirty works using get_strided_index).

LaurentMazare commented 7 months ago

Thanks for investigating and for the details. I'm not sure to get the full picture of it but to me it feels that the tensor layout you mention actually is contiguous in memory and so could be handled by contiguous kernels (so would work well in cuda post #2000 ), so the issue might be some assumptions on your backend side but maybe I'm missing something?

guoqingbao commented 7 months ago

Thanks for investigating and for the details. I'm not sure to get the full picture of it but to me it feels that the tensor layout you mention actually is contiguous in memory and so could be handled by contiguous kernels (so would work well in cuda post #2000 ), so the issue might be some assumptions on your backend side but maybe I'm missing something?

No, the tensor was not contiguous, as I stated earlier: "the tensor::transpose operation does not change the stride of the original tensor, and the transposed tensor was treated as contiguous." The transposed tensor (with dimensions 1 and 2 swapped) should be considered non-contiguous because its memory layout necessitates alteration. The CUDA kernels implicitly made these transposed tensors contiguous, which is why no obvious issues were detected in the candle GPU backend. However, the correct logic dictates that a transpose resulting in a non-contiguous tensor, which was labeled as contiguous in the recent revision. I suspect that if all CUDA kernel contiguity checks adhere to this revision, it will lead to problems.

LaurentMazare commented 7 months ago

A tensor with shape [1, 1, 13, 13] with stride [0, 0, 13, 1] is contiguous in memory with a C layout. I've removed the bit you mentioned in the contiguous detection in the cuda kernels in #2000 and tested a whole lot of models. All of them are running fine. Again very happy if I get an example of something that doesn't work properly on cpu or gpu. I've merged #2000 and now am running all the tests and models from our internal codebase and so far it's all good (it will take a bit of time to run everything though).

huggingface / candle

Recent revision for contiguous check has problems #1991

The expected output (without "if dim > 1" condition)

The current output (with "if dim > 1" condition)