Closed akshit397a closed 2 weeks ago
Hey @akshit397a, do you have a proposal of how to do it better? Feel free to open a PR doing so
Hey, @LysandreJik I will open a PR soon after verifying things.
Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
SYSTEMINFORMATION Version: 5.23.5 │ └─────────────────────────────────────────────────────────────────────────────────────────┘
Operating System: ────────────────────────────────────────────────────────────────────────────────────────── Platform : Windows Distro : Microsoft Windows 11 Home Single Language Release : 10.0.22631 Codename : Kernel : 10.0.22631 Arch : x64 Hostname : DESKTOP-FFR0VG0 Codepage : 437 Build : 22631 Hypervisor : true RemoteSession :
System: ────────────────────────────────────────────────────────────────────────────────────────── Manufacturer : Dell Inc. Model : Inspiron 15 3525 Version : 1.19.0 Virtual :
CPU: ────────────────────────────────────────────────────────────────────────────────────────── Manufacturer : AMD Brand : Ryzen 5 5500U with Radeon Graphics Family : 23 Model : 104 Stepping : 1 Speed : 2.1 Cores : 12 PhysicalCores : 6 PerformanceCores : 12 EfficiencyCores : Processors : 1 Socket : None
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
import copy import json import os import warnings from collections import UserDict from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple, Union
import numpy as np
from .dynamic_module_utils import custom_object_save from .utils import ( FEATURE_EXTRACTOR_NAME, PushToHubMixin, TensorType, add_model_info_to_auto_map, add_model_info_to_custom_pipelines, cached_file, copy_func, download_url, is_flax_available, is_jax_tensor, is_numpy_array, is_offline_mode, is_remote_url, is_tf_available, is_torch_available, is_torch_device, is_torch_dtype, logging, requires_backends, )
if TYPE_CHECKING: if is_torch_available(): import torch # noqa
logger = logging.get_logger(name)
PreTrainedFeatureExtractor = Union["SequenceFeatureExtractor"] # noqa: F821
class BatchFeature(UserDict): r""" Holds the output of the [~SequenceFeatureExtractor.pad] and feature extractor specific call methods.
class FeatureExtractionMixin(PushToHubMixin): """ This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature extractors. """
Expected behavior
Actual Behavior Redundant Checks:
For every element in the dictionary, the code calls self._is_tensor(value) repeatedly. If the check is happening in a large dataset, this leads to redundant function lookups every time the loop runs. Similarly, self._as_tensor(value) is called inside the loop, causing repetitive lookups for the conversion function. Slower Performance:
As the loop grows (with more items to process), these repeated function lookups for is_tensor and as_tensor can lead to noticeable inefficiencies, especially for large datasets or frequent calls to this function.
Expected behaviour
The functions is_tensor (to check if an item is already a tensor) and as_tensor (to convert an item into a tensor) are retrieved once, at the start of the function, and used throughout the loop. This eliminates redundant lookups or calls inside the loop, reducing overhead and improving performance. The conversion will still occur for each non-tensor value, but with less function call overhead.
For each key-value pair in the dictionary, if the value is not already a tensor, it will attempt to convert it to a tensor using the as_tensor function. If conversion fails (e.g., because the value is not convertible to a tensor), it raises a ValueError with a descriptive message that includes the problematic key.