huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size in Datasets #5914

Open ravenouse opened 1 year ago

ravenouse commented 1 year ago

Describe the bug

When using the filter or map function to preprocess a dataset, a ValueError is encountered with the error message "array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size."

Detailed error message: Traceback (most recent call last): File "data_processing.py", line 26, in processed_dataset[split] = samromur_children[split].map(prepare_dataset, cache_file_name=cache_dict[split],writer_batch_size = 50) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2405, in map desc=desc, File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 524, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/fingerprint.py", line 480, in wrapper out = func(self, args, kwargs) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2756, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, *fn_kwargs) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2347, in decorated result = f(decorated_item, args, *kwargs) File "data_processing.py", line 11, in prepare_dataset audio = batch["audio"] File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 123, in getitem value = decode_nested_example(self.features[key], value) if value is not None else None File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/features.py", line 1260, in decode_nested_example return schema.decode_example(obj, token_per_repo_id=token_per_repo_id) if obj is not None else None File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/audio.py", line 156, in decode_example array, sampling_rate = self._decode_non_mp3_path_like(path, token_per_repo_id=token_per_repo_id) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/audio.py", line 257, in _decode_non_mp3_path_like array, sampling_rate = librosa.load(f, sr=self.sampling_rate, mono=self.mono) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/librosa/core/audio.py", line 176, in load y, sr_native = __soundfile_load(path, offset, duration, dtype) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/librosa/core/audio.py", line 222, in __soundfile_load y = sf_desc.read(frames=frame_duration, dtype=dtype, always_2d=False).T File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/soundfile.py", line 891, in read out = self._create_empty_array(frames, always_2d, dtype) File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/soundfile.py", line 1323, in _create_empty_array return np.empty(shape, dtype, order='C') ValueError: array is too big; `arr.size arr.dtype.itemsize` is larger than the maximum possible size.

Steps to reproduce the bug

from datasets import load_dataset, DatasetDict
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer

samromur_children= load_dataset("language-and-voice-lab/samromur_children")
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="icelandic", task="transcribe")

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=16000).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["normalized_text"]).input_ids
    return batch

cache_dict = {"train": "./cache/audio_train.cache", \
              "validation": "./cache/audio_validation.cache", \
              "test": "./cache/audio_test.cache"}
filter_cache_dict = {"train": "./cache/filter_train.arrow", \
                "validation": "./cache/filter_validation.arrow", \
                "test": "./cache/filter_test.arrow"}

print("before filtering")
print(samromur_children)
#filter the dataset to only include examples with more than 2 seconds of audio
samromur_children = samromur_children.filter(lambda example: example["audio"]["array"].shape[0] > 16000*2, cache_file_names=filter_cache_dict)   
print("after filtering")
print(samromur_children)
processed_dataset = DatasetDict()
# processed_dataset = samromur_children.map(prepare_dataset, cache_file_names=cache_dict, num_proc=10,)
for split in ["train", "validation", "test"]:
    processed_dataset[split] = samromur_children[split].map(prepare_dataset, cache_file_name=cache_dict[split])

Expected behavior

The dataset is successfully processed and ready to train the model.

Environment info

Python version: 3.7.13 datasets package version: 2.4.0 librosa package version: 0.10.0.post2

pranav-sridhar commented 4 weeks ago

Was a fix for this identified?

ravenouse commented 4 weeks ago

Was a fix for this identified?

Hi @pranav-sridhar Have you encountered a similar issue with this dataset? I’ve modified the dataset construction script to address the problem. Feel free to use this updated version to avoid the issue.

Ericwang/samromur_children_test