Open ravenouse opened 1 year ago
Was a fix for this identified?
Was a fix for this identified?
Hi @pranav-sridhar Have you encountered a similar issue with this dataset? I’ve modified the dataset construction script to address the problem. Feel free to use this updated version to avoid the issue.
Describe the bug
When using the
filter
ormap
function to preprocess a dataset, a ValueError is encountered with the error message "array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size."Detailed error message: Traceback (most recent call last): File "data_processing.py", line 26, in
processed_dataset[split] = samromur_children[split].map(prepare_dataset, cache_file_name=cache_dict[split],writer_batch_size = 50)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2405, in map
desc=desc,
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/fingerprint.py", line 480, in wrapper
out = func(self, args, kwargs)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2756, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
processed_inputs = function(fn_args, additional_args, *fn_kwargs)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
result = f(decorated_item, args, *kwargs)
File "data_processing.py", line 11, in prepare_dataset
audio = batch["audio"]
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 123, in getitem
value = decode_nested_example(self.features[key], value) if value is not None else None
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/features.py", line 1260, in decode_nested_example
return schema.decode_example(obj, token_per_repo_id=token_per_repo_id) if obj is not None else None
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/audio.py", line 156, in decode_example
array, sampling_rate = self._decode_non_mp3_path_like(path, token_per_repo_id=token_per_repo_id)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/datasets/features/audio.py", line 257, in _decode_non_mp3_path_like
array, sampling_rate = librosa.load(f, sr=self.sampling_rate, mono=self.mono)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/librosa/core/audio.py", line 176, in load
y, sr_native = __soundfile_load(path, offset, duration, dtype)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/librosa/core/audio.py", line 222, in __soundfile_load
y = sf_desc.read(frames=frame_duration, dtype=dtype, always_2d=False).T
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/soundfile.py", line 891, in read
out = self._create_empty_array(frames, always_2d, dtype)
File "/projects/zhwa3087/software/anaconda/envs/mycustomenv/lib/python3.7/site-packages/soundfile.py", line 1323, in _create_empty_array
return np.empty(shape, dtype, order='C')
ValueError: array is too big; `arr.size arr.dtype.itemsize` is larger than the maximum possible size.
Steps to reproduce the bug
Expected behavior
The dataset is successfully processed and ready to train the model.
Environment info
Python version: 3.7.13 datasets package version: 2.4.0 librosa package version: 0.10.0.post2