Inefficient Tensor Conversion in convert_to_tensors

akshit397a commented 1 month ago

System Info

SYSTEMINFORMATION Version: 5.23.5 │ └─────────────────────────────────────────────────────────────────────────────────────────┘

Operating System: ────────────────────────────────────────────────────────────────────────────────────────── Platform : Windows Distro : Microsoft Windows 11 Home Single Language Release : 10.0.22631 Codename : Kernel : 10.0.22631 Arch : x64 Hostname : DESKTOP-FFR0VG0 Codepage : 437 Build : 22631 Hypervisor : true RemoteSession :

System: ────────────────────────────────────────────────────────────────────────────────────────── Manufacturer : Dell Inc. Model : Inspiron 15 3525 Version : 1.19.0 Virtual :

CPU: ────────────────────────────────────────────────────────────────────────────────────────── Manufacturer : AMD Brand : Ryzen 5 5500U with Radeon Graphics Family : 23 Model : 104 Stepping : 1 Speed : 2.1 Cores : 12 PhysicalCores : 6 PerformanceCores : 12 EfficiencyCores : Processors : 1 Socket : None

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import copy import json import os import warnings from collections import UserDict from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple, Union

import numpy as np

from .dynamic_module_utils import custom_object_save from .utils import ( FEATURE_EXTRACTOR_NAME, PushToHubMixin, TensorType, add_model_info_to_auto_map, add_model_info_to_custom_pipelines, cached_file, copy_func, download_url, is_flax_available, is_jax_tensor, is_numpy_array, is_offline_mode, is_remote_url, is_tf_available, is_torch_available, is_torch_device, is_torch_dtype, logging, requires_backends, )

if TYPE_CHECKING: if is_torch_available(): import torch # noqa

logger = logging.get_logger(name)

PreTrainedFeatureExtractor = Union["SequenceFeatureExtractor"] # noqa: F821

class BatchFeature(UserDict): r""" Holds the output of the [~SequenceFeatureExtractor.pad] and feature extractor specific call methods.

This class is derived from a python dictionary and can be used as a dictionary.

Args:
    data (dict, *optional*):
        Dictionary of lists/arrays/tensors returned by the __call__/pad methods ('input_values', 'attention_mask',
        etc.).
    tensor_type (Union[None, str, TensorType], *optional*):
        You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at
        initialization.
"""

def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
    super().__init__(data)
    self.convert_to_tensors(tensor_type=tensor_type)

def __getitem__(self, item: str) -> Union[Any]:
    """
    If the key is a string, returns the value of the dict associated to key ('input_values', 'attention_mask',
    etc.).
    """
    if isinstance(item, str):
        return self.data[item]
    else:
        raise KeyError("Indexing with integers is not available when using Python based feature extractors")

def __getattr__(self, item: str):
    try:
        return self.data[item]
    except KeyError:
        raise AttributeError

def __getstate__(self):
    return {"data": self.data}

def __setstate__(self, state):
    if "data" in state:
        self.data = state["data"]

# Copied from transformers.tokenization_utils_base.BatchEncoding.keys
def keys(self):
    return self.data.keys()

# Copied from transformers.tokenization_utils_base.BatchEncoding.values
def values(self):
    return self.data.values()

# Copied from transformers.tokenization_utils_base.BatchEncoding.items
def items(self):
    return self.data.items()

def _get_is_as_tensor_fns(self, tensor_type: Optional[Union[str, TensorType]] = None):
    if tensor_type is None:
        return None, None

    # Convert to TensorType
    if not isinstance(tensor_type, TensorType):
        tensor_type = TensorType(tensor_type)

    # Get a function reference for the correct framework
    if tensor_type == TensorType.TENSORFLOW:
        if not is_tf_available():
            raise ImportError(
                "Unable to convert output to TensorFlow tensors format, TensorFlow is not installed."
            )
        import tensorflow as tf

        as_tensor = tf.constant
        is_tensor = tf.is_tensor
    elif tensor_type == TensorType.PYTORCH:
        if not is_torch_available():
            raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
        import torch  # noqa

        def as_tensor(value):
            if isinstance(value, (list, tuple)) and len(value) > 0:
                if isinstance(value[0], np.ndarray):
                    value = np.array(value)
                elif (
                    isinstance(value[0], (list, tuple))
                    and len(value[0]) > 0
                    and isinstance(value[0][0], np.ndarray)
                ):
                    value = np.array(value)
            if isinstance(value, np.ndarray):
                return torch.from_numpy(value)
            else:
                return torch.tensor(value)

        is_tensor = torch.is_tensor
    elif tensor_type == TensorType.JAX:
        if not is_flax_available():
            raise ImportError("Unable to convert output to JAX tensors format, JAX is not installed.")
        import jax.numpy as jnp  # noqa: F811

        as_tensor = jnp.array
        is_tensor = is_jax_tensor
    else:

        def as_tensor(value, dtype=None):
            if isinstance(value, (list, tuple)) and isinstance(value[0], (list, tuple, np.ndarray)):
                value_lens = [len(val) for val in value]
                if len(set(value_lens)) > 1 and dtype is None:
                    # we have a ragged list so handle explicitly
                    value = as_tensor([np.asarray(val) for val in value], dtype=object)
            return np.asarray(value, dtype=dtype)

        is_tensor = is_numpy_array
    return is_tensor, as_tensor

def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None):
    """
    Convert the inner content to tensors.

    Args:
        tensor_type (str or [~utils.TensorType], *optional*):
            The type of tensors to use. If str, should be one of the values of the enum [~utils.TensorType]. If
            None, no modification is done.
    """
    if tensor_type is None:
        return self

    is_tensor, as_tensor = self._get_is_as_tensor_fns(tensor_type)

    # Do the tensor conversion in batch
    for key, value in self.items():
        try:
            if not is_tensor(value):
                tensor = as_tensor(value)

                self[key] = tensor
        except:  # noqa E722
            if key == "overflowing_values":
                raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
            raise ValueError(
                "Unable to create tensor, you should probably activate padding "
                "with 'padding=True' to have batched tensors with the same length."
            )

    return self

def to(self, *args, **kwargs) -> "BatchFeature":
    """
    Send all values to device by calling v.to(*args, **kwargs) (PyTorch only). This should support casting in
    different dtypes and sending the BatchFeature to a different device.

    Args:
        args (Tuple):
            Will be passed to the to(...) function of the tensors.
        kwargs (Dict, *optional*):
            Will be passed to the to(...) function of the tensors.

    Returns:
        [BatchFeature]: The same instance after modification.
    """
    requires_backends(self, ["torch"])
    import torch  # noqa

    new_data = {}
    device = kwargs.get("device")
    # Check if the args are a device or a dtype
    if device is None and len(args) > 0:
        # device should be always the first argument
        arg = args[0]
        if is_torch_dtype(arg):
            # The first argument is a dtype
            pass
        elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
            device = arg
        else:
            # it's something else
            raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
    # We cast only floating point tensors to avoid issues with tokenizers casting LongTensor to FloatTensor
    for k, v in self.items():
        # check if v is a floating point
        if torch.is_floating_point(v):
            # cast and send to device
            new_data[k] = v.to(*args, **kwargs)
        elif device is not None:
            new_data[k] = v.to(device=device)
        else:
            new_data[k] = v
    self.data = new_data
    return self

class FeatureExtractionMixin(PushToHubMixin): """ This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature extractors. """

_auto_class = None

def __init__(self, **kwargs):
    """Set elements of kwargs as attributes."""
    # Pop "processor_class" as it should be saved as private attribute
    self._processor_class = kwargs.pop("processor_class", None)
    # Additional attributes without default values
    for key, value in kwargs.items():
        try:
            setattr(self, key, value)
        except AttributeError as err:
            logger.error(f"Can't set {key} with value {value} for {self}")
            raise err

def _set_processor_class(self, processor_class: str):
    """Sets processor class as an attribute."""
    self._processor_class = processor_class

@classmethod
def from_pretrained(
    cls,
    pretrained_model_name_or_path: Union[str, os.PathLike],
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    force_download: bool = False,
    local_files_only: bool = False,
    token: Optional[Union[str, bool]] = None,
    revision: str = "main",
    **kwargs,
):

Expected behavior

Actual Behavior Redundant Checks:

For every element in the dictionary, the code calls self._is_tensor(value) repeatedly. If the check is happening in a large dataset, this leads to redundant function lookups every time the loop runs. Similarly, self._as_tensor(value) is called inside the loop, causing repetitive lookups for the conversion function. Slower Performance:

As the loop grows (with more items to process), these repeated function lookups for is_tensor and as_tensor can lead to noticeable inefficiencies, especially for large datasets or frequent calls to this function.

Expected behaviour

The functions is_tensor (to check if an item is already a tensor) and as_tensor (to convert an item into a tensor) are retrieved once, at the start of the function, and used throughout the loop. This eliminates redundant lookups or calls inside the loop, reducing overhead and improving performance. The conversion will still occur for each non-tensor value, but with less function call overhead.

For each key-value pair in the dictionary, if the value is not already a tensor, it will attempt to convert it to a tensor using the as_tensor function. If conversion fails (e.g., because the value is not convertible to a tensor), it raises a ValueError with a descriptive message that includes the problematic key.

LysandreJik commented 1 month ago

Hey @akshit397a, do you have a proposal of how to do it better? Feel free to open a PR doing so

akshit397a commented 1 month ago

Hey, @LysandreJik I will open a PR soon after verifying things.

LysandreJik commented 1 month ago

Thank you!

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers