ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.79k stars 16.36k forks source link

Get last_hidden_state of a prediction in YOLO #11719

Closed Alberto1404 closed 1 year ago

Alberto1404 commented 1 year ago

Search before asking

Question

Hello there. I am attempting to get the last hidden state of a prediction, that is, the equivalent of what is done in HuggingFace when doing inference on an image as follows:

new_batch = {"pixel_values": image_batch_transformed.to(device)} # Apply torchvision.transforms to read PIL image and convert to Tensor
with torch.no_grad():
    embeddings = model(**new_batch).last_hidden_state[:, 0]

I read the model as follows:

model = torch.hub.load('ultralytics/yolov5', 'custom', path = './best.pt')

No matter if I instantiate the model as an AutoShape object or a DetectMultiBackend object, I am unable to dothe equivalent. Aby help would be highly appreciated. Thank you

Additional

No response

glenn-jocher commented 1 year ago

@Alberto1404 hi there! Thanks for reaching out. To get the last hidden state of a prediction in YOLOv5, you can use the following approach:

from torch import nn

model = torch.hub.load('ultralytics/yolov5', 'custom', path='./best.pt')

with torch.no_grad():
    output = model(model.model.model[0](image_tensor))  # Perform inference on the image
    last_hidden_state = model.model.model[-1].forward(output.squeeze())
    # last_hidden_state is now the equivalent of HuggingFace's last_hidden_state[:, 0] in your example

Please note that model.model.model[-1] represents the last layer of the model, and output.squeeze() removes any extra dimensions from the output tensor.

Let me know if there's anything else I can assist you with.

Alberto1404 commented 1 year ago

@glenn-jocher thank you for your fast reply. Nevertheless it still does not work for me. Shall we reproduce it with a pretrained model.

(yolo) alberto@sr-arg:~/Repo/image-similarity$ python
Python 3.9.16 (main, Mar  8 2023, 14:00:05) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from PIL import Image
>>> model = torch.hub.load('ultralytics/yolov5', 'yolov5x', device='cpu')
Using cache found in /home/alberto/.cache/torch/hub/ultralytics_yolov5_master
YOLOv5 🚀 2023-4-25 Python-3.9.16 torch-2.0.1+cu117 CPU

Fusing layers... 
YOLOv5x summary: 444 layers, 86705005 parameters, 0 gradients
Adding AutoShape... 
>>> from torchvision import transforms as T
>>> ima = Image.open('/home/alberto/Documents/s3_aws_/batch_ptz_pruebas/1682677336926260624ptz000897.jpeg')
>>> tr = T.Compose(
...     [
...             T.Resize((640,640)),
...             T.ToTensor()
...     ]
... )
>>> ima_tensor = tr(ima)
>>> ima_tensor.shape
torch.Size([3, 640, 640])
>>> with torch.no_grad():
...     output = model(model.model.model[0](image_tensor)) # <-- AUTOSHAPE TRUE
...     last_hidden_state = model.model.model[-1].forward(output.squeeze())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
TypeError: 'DetectionModel' object is not subscriptable  # ERROR DURING INFERENCE
>>> 
>>> model = torch.hub.load('ultralytics/yolov5', 'yolov5x', device='cpu', autoshape = False) # AUTOSHAPE FALSE (MULTIBACKEND OBJECT)
Using cache found in /home/alberto/.cache/torch/hub/ultralytics_yolov5_master
YOLOv5 🚀 2023-4-25 Python-3.9.16 torch-2.0.1+cu117 CPU
>>> with torch.no_grad():
...     output = model(model.model.model[0](ima_tensor))
...     last_hidden_state = model.model.model[-1].forward(output.squeeze())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/.cache/torch/hub/ultralytics_yolov5_master/models/common.py", line 56, in forward
    return self.act(self.bn(self.conv(x)))
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 138, in forward
    self._check_input_dim(input)
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 410, in _check_input_dim
    raise ValueError("expected 4D input (got {}D input)".format(input.dim()))
ValueError: expected 4D input (got 3D input) # IMA TENSOR MUST INCLUDE BATCH DIMENSION (B,C,H,W)
>>> with torch.no_grad():
...     output = model(model.model.model[0](ima_tensor.unsqueeze(0)))
...     last_hidden_state = model.model.model[-1].forward(output.squeeze())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/.cache/torch/hub/ultralytics_yolov5_master/models/common.py", line 514, in forward
    y = self.model(im, augment=augment, visualize=visualize) if augment or visualize else self.model(im)
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/.cache/torch/hub/ultralytics_yolov5_master/models/yolo.py", line 209, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "/home/alberto/.cache/torch/hub/ultralytics_yolov5_master/models/yolo.py", line 121, in _forward_once
    x = m(x)  # run
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/.cache/torch/hub/ultralytics_yolov5_master/models/common.py", line 56, in forward
    return self.act(self.bn(self.conv(x)))
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/alberto/anaconda3/envs/yolo/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [80, 3, 6, 6], expected input[1, 80, 320, 320] to have 3 channels, but got 80 channels instead
github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher commented 12 months ago

@Alberto1404 It looks like the issue might be related to the input format of the image tensor and the interaction with the YOLOv5 model.

When using autoshape=True, YOLOv5 expects input images to have a batch dimension. You can try modifying the code to include the batch dimension for the input image tensor, like this:

with torch.no_grad():
    output = model(ima_tensor.unsqueeze(0))  # Add batch dimension
    last_hidden_state = model.model.model[-1].forward(output.squeeze())

This modification should resolve the "expected 4D input" error.

Let me know if you encounter any further issues.

wentao-uw commented 8 months ago
with torch.no_grad():
    output = model(ima_tensor.unsqueeze(0))  # Add batch dimension
    last_hidden_state = model.model.model[-1].forward(output.squeeze())

I got this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[33], [line 12](vscode-notebook-cell:?execution_count=33&line=12)
     [10](vscode-notebook-cell:?execution_count=33&line=10) with torch.no_grad():
     [11](vscode-notebook-cell:?execution_count=33&line=11)     output = model(input_tensor_1.unsqueeze(0))  # Add batch dimension
---> [12](vscode-notebook-cell:?execution_count=33&line=12)     last_hidden_state = model.model.model[-1].forward(output.squeeze())

AttributeError: 'list' object has no attribute 'squeeze'
glenn-jocher commented 8 months ago

@wentao-uw i apologize for the confusion earlier. It seems there was a misunderstanding in accessing the model's components and handling its outputs. YOLOv5's architecture and output processing are quite different from models like those in HuggingFace, which directly provide embeddings or hidden states.

For YOLOv5, the model outputs are primarily designed for object detection tasks, providing bounding boxes, object classes, and confidence scores rather than embeddings or hidden states directly usable for other tasks.

If you're looking to work with the internal states or features of the YOLOv5 model, you might need to modify the model's source code or access specific layers directly for your use case. However, this requires a deep understanding of the model's architecture and might not directly provide a "last_hidden_state" as you would expect from models designed for NLP tasks.

For advanced customization or accessing specific layers' outputs, you might consider directly working with the model's source code or consulting the documentation for more insights on the model's architecture. Unfortunately, due to the complexity and the specific nature of your request, providing a simple code snippet might not be feasible without a clear understanding of what you aim to achieve with the "last hidden state" in the context of an object detection model like YOLOv5.

If your goal is to extract features for image similarity or another task not directly related to object detection, you might need to look into feature extraction techniques or models specifically designed for such purposes.

Again, I apologize for any confusion and hope this clarifies the situation.

wentao-uw commented 8 months ago

Hi, Thank you for your clarifications. I'm currently addressing an issue where the detection model fails to identify out-of-distribution samples. To illustrate, I've developed a model to identify documents (such as driver licenses, passports, etc.) from different countries. However, when presented with a fake document, the model always assigns a label to it. My goal is to identify the out of distribution samples and classify in distribution samples correctly.

To achieve this, I'm considering the use of features extracted by YOLO, and input these features into KNN or alternative post-hoc models designed to recognize out-of-distribution instances.

Here is an example: use resnet18 to extract features and use KNN to recognize out-of-distribution instances. What I want to do is replace the resnet18 with yolov5. https://blog.munhou.com/2022/12/01/Detecting%20Out-of-Distribution%20Samples%20with%20Knn/

glenn-jocher commented 8 months ago

@wentao-uw i understand your goal now, and it's a fascinating application of YOLOv5 for identifying out-of-distribution samples, such as distinguishing genuine documents from fake ones. To replace ResNet18 with YOLOv5 for feature extraction, you'll need to access intermediate layers of the YOLOv5 model, as these can provide the rich feature representations you're looking for.

Here's a simplified approach to extract features from an intermediate layer of YOLOv5. This example assumes you're interested in features from one of the later layers, but you can adjust the layer index based on your needs:

import torch

# Load the YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)  # Example with yolov5s model

# Function to extract features from a specific layer
def get_features_from_layer(model, image_tensor, layer_idx=-2):
    """
    Extract features from a specific layer.

    Parameters:
    - model: The loaded YOLOv5 model.
    - image_tensor: The input image tensor with shape [C, H, W] and batch dimension added.
    - layer_idx: Index of the layer from which to extract features. Default is -2, the second to last layer.

    Returns:
    - features: The extracted features from the specified layer.
    """
    # Ensure model is in evaluation mode
    model.eval()

    # Hook to capture the output of the specified layer
    features = []
    def hook_fn(module, input, output):
        features.append(output)

    # Register the hook to the desired layer
    handle = model.model.model[layer_idx].register_forward_hook(hook_fn)

    # Perform a forward pass to get the features
    with torch.no_grad():
        _ = model(image_tensor.unsqueeze(0))  # Add batch dimension if not present

    # Remove the hook
    handle.remove()

    # Return the features
    return features[0]

# Example usage
image_tensor = torch.rand((3, 640, 640))  # Example image tensor
features = get_features_from_layer(model, image_tensor)
print(features.shape)  # Print the shape of the extracted features

This code snippet demonstrates how to extract features from a specific layer of the YOLOv5 model. You can adjust layer_idx to target different layers depending on which features you find most useful for your out-of-distribution detection task.

Once you have these features, you can feed them into a KNN classifier or any other model you choose to distinguish between in-distribution and out-of-distribution samples. Remember, the choice of layer and the subsequent model for out-of-distribution detection might require some experimentation to optimize performance for your specific use case.

I hope this helps you move forward with your project. If you have further questions or need more assistance, feel free to ask.

wentao-uw commented 7 months ago
# load the model and image
image_1 = PIL.Image.open('image_path')
transform = transforms.Compose([
            transforms.Resize((320, 320)),
            transforms.ToTensor()
        ])
image_tensor = transform(image_1)
model_path = "yolo_v8_model_path"
model = YOLO(model_path)

When I run the code multiple times at the env with torch(gpu version)

res = get_features_from_layer(model, image_tensor, layer_idx=-2)

the result from the first time is always different from the result of the second time. However, run this code at env with torch(cpu version) the result will be the same.

glenn-jocher commented 7 months ago

@wentao-uw hello! It looks like you're observing different outputs from the same input when using a GPU environment compared to a CPU environment. This behavior can occur due to several factors specific to how GPUs handle floating-point operations, which can lead to non-deterministic results in some scenarios.

To ensure consistency across runs in a GPU environment, you could try setting the model and its tensors to deterministic mode by adding these lines at the beginning of your script:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

These lines configure PyTorch to use deterministic algorithms where possible. Note, however, that this might impact performance due to the loss of optimization opportunities.

Additionally, ensure that you are using the same device (CPU or GPU) for both the model and the tensor:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
image_tensor = image_tensor.to(device)
model = model.to(device)

This ensures that both the input tensor and the model are on the same device, avoiding unnecessary data movement that could affect results.

Let me know if this helps or if you have any more questions. Happy coding! 🚀

wentao-uw commented 7 months ago

Hello, I put these lines at the begining of my script:

# part-1
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# load the model and image
image_1 = PIL.Image.open('image_path')
transform = transforms.Compose([
            transforms.Resize((320, 320)),
            transforms.ToTensor()
        ])
image_tensor = transform(image_1)
model_path = "yolo_v8_model_path"
model = YOLO(model_path)

#  make sure on the same device
device = torch.device('cuda:0')
model = model.to(device)
image_tensor = image_tensor.to(device)
# part-2
# extract features
res = get_features_from_layer(model, image_tensor, layer_idx=-2)
res

run the part-1 code once and part-2 code twice and get different results, but receive the same results when setting device = torch.device('cpu')

glenn-jocher commented 7 months ago

@wentao-uw hello there! 👋 It seems like the issue you're encountering with getting different results on GPU despite setting torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False is quite peculiar. Generally, these settings should help ensure deterministic results on the GPU, similar to the CPU.

To troubleshoot, ensure that all random number generators are also set to a fixed seed right after setting determinism flags for both PyTorch and CUDA. This can help in making experiments reproducible, especially when randomness is involved in the model initialization, data augmentation, or any other process:

import torch
import random
import numpy as np

# Set seeds for reproducibility
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
random.seed(seed)
np.random.seed(seed)
# Ensure determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Remember, while these settings can help achieve deterministic results, they might come at the cost of reduced performance due to the restriction of certain optimizations.

If after setting the seeds as suggested you still face inconsistencies, double-check that all inputs to the model during the feature extraction (res = get_features_from_layer(model, image_tensor, layer_idx=-2)) are strictly the same and that there's no underlying randomness in the model behavior itself.

I hope this helps! Feel free to reach back if there's anything else we can assist you with. Happy coding! 😊