Open Coolcoder45 opened 6 months ago
Hi, thank you for reporting! This is definitely a bug.
Workaround: add the following arg to your tfds.load
call:
tfds.load(..., download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})
We'll look on how to update the code and update on the bug.
It's still giving error.
import tensorflow_datasets as `tfds`
plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})
Gives
Downloading and preparing dataset 6.56 GiB (download: 6.56 GiB, generated: 6.81 GiB, total: 13.37 GiB) to /root/tensorflow_datasets/plant_leaves/0.1.1...
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-3-608b46b22c6c>](https://localhost:8080/#) in <cell line: 4>()
2 #plant_leaves = tfds.load('plant_leaves', split='train', shuffle_files=True)
3 #plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, as_data_source=True)
----> 4 plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})
5 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
167 metadata = self._start_call()
168 try:
--> 169 return function(*args, **kwargs)
170 except Exception:
171 metadata.mark_error()
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
645 try_gcs,
646 )
--> 647 _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
648
649 if as_dataset_kwargs is None:
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
504 if download:
505 download_and_prepare_kwargs = download_and_prepare_kwargs or {}
--> 506 dbuilder.download_and_prepare(**download_and_prepare_kwargs)
507
508
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
167 metadata = self._start_call()
168 try:
--> 169 return function(*args, **kwargs)
170 except Exception:
171 metadata.mark_error()
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in download_and_prepare(self, download_dir, download_config, file_format)
679 # to generate the files.
680 if file_format:
--> 681 self.info.set_file_format(file_format, override=True)
682
683 # Create a tmp dir and rename to self.data_dir on successful exit.
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_info.py](https://localhost:8080/#) in set_file_format(self, file_format, override)
470 )
471 if override and self._fully_initialized:
--> 472 raise RuntimeError(
473 "Cannot override the file format "
474 "when the DatasetInfo is already fully initialized!"
RuntimeError: Cannot override the file format when the DatasetInfo is already fully initialized!
Same errors on refcoco dataset.
NotImplementedError: `.as_dataset()` not implemented for ArrayRecord files. Please, use `.as_data_source()`.
Anyway, one thing I do to solve this is add the following line:
builder = tfds.builder('ref_coco/refcocog_umd')
builder.info.set_file_format(tfds.core.FileFormat.PARQUET, override=True, override_if_initialized=True)
builder.download_and_prepare()
ref_ds = tfds.load('ref_coco/refcocog_umd', split='validation')
builder = tfds.builder('oxford_iiit_pet') builder.info.set_file_format(tfds.core.FileFormat.PARQUET, override=True, override_if_initialized=True) builder.download_and_prepare()
dataset, info = tfds.load('oxford_iiit_pet:4.0.0', download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})
also erroers:
NotImplementedError Traceback (most recent call last) Cell In[34], line 5 2 builder.info.set_file_format(tfds.core.FileFormat.PARQUET, override=True, override_if_initialized=True) 3 builder.download_and_prepare() ----> 5 dataset, info = tfds.load('oxford_iiit_pet:4.0.0', download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/logging/init.py:176, in _FunctionDecorator.call(self, function, instance, args, kwargs) 174 metadata = self._start_call() 175 try: --> 176 return function(*args, **kwargs) 177 except Exception: 178 metadata.mark_error()
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/load.py:673, in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs) 670 as_dataset_kwargs.setdefault('shuffle_files', shuffle_files) 671 as_dataset_kwargs.setdefault('read_config', read_config) --> 673 ds = dbuilder.as_dataset(**as_dataset_kwargs) 674 if with_info: 675 return ds, dbuilder.info
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/logging/init.py:176, in _FunctionDecorator.call(self, function, instance, args, kwargs) 174 metadata = self._start_call() 175 try: --> 176 return function(*args, **kwargs) 177 except Exception: 178 metadata.mark_error()
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/dataset_builder.py:1026, in DatasetBuilder.as_dataset(self, split, batch_size, shuffle_files, decoders, read_config, as_supervised) 1017 # Create a dataset for each of the given splits 1018 build_single_dataset = functools.partial( 1019 self._build_single_dataset, 1020 shuffle_files=shuffle_files, (...) 1024 as_supervised=as_supervised, 1025 ) -> 1026 all_ds = tree.map_structure(build_single_dataset, split) 1027 return all_ds
File /usr/local/lib/python3.12/dist-packages/tree/init.py:428, in map_structure(func, *structures, *kwargs) 425 for other in structures[1:]: 426 assert_same_structure(structures[0], other, check_types=check_types) 427 return unflatten_as(structures[0], --> 428 [func(args) for args in zip(*map(flatten, structures))])
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/dataset_builder.py:1044, in DatasetBuilder._build_single_dataset(self, split, batch_size, shuffle_files, decoders, read_config, as_supervised) 1041 batch_size = self.info.splits.total_num_examples or sys.maxsize 1043 # Build base dataset -> 1044 ds = self._as_dataset( 1045 split=split, 1046 shuffle_files=shuffle_files, 1047 decoders=decoders, 1048 read_config=read_config, 1049 ) 1050 # Auto-cache small datasets which are small enough to fit in memory. 1051 if self._should_cache_ds( 1052 split=split, shuffle_files=shuffle_files, read_config=read_config 1053 ):
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/dataset_builder.py:1498, in FileReaderBuilder._as_dataset(self, split, decoders, read_config, shuffle_files) 1492 reader = reader_lib.Reader( 1493 self.data_dir, 1494 example_specs=example_specs, 1495 file_format=self.info.file_format, 1496 ) 1497 decode_fn = functools.partial(features.decode_example, decoders=decoders) -> 1498 return reader.read( 1499 instructions=split, 1500 split_infos=self.info.splits.values(), 1501 decode_fn=decode_fn, 1502 read_config=read_config, 1503 shuffle_files=shuffle_files, 1504 disable_shuffling=self.info.disable_shuffling, 1505 )
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/reader.py:430, in Reader.read(self, instructions, split_infos, read_config, shuffle_files, disable_shuffling, decode_fn) 421 file_instructions = splits_dict[instruction].file_instructions 422 return self.read_files( 423 file_instructions, 424 read_config=read_config, (...) 427 decode_fn=decode_fn, 428 ) --> 430 return tree.map_structure(_read_instruction_to_ds, instructions)
File /usr/local/lib/python3.12/dist-packages/tree/init.py:428, in map_structure(func, *structures, *kwargs) 425 for other in structures[1:]: 426 assert_same_structure(structures[0], other, check_types=check_types) 427 return unflatten_as(structures[0], --> 428 [func(args) for args in zip(*map(flatten, structures))])
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/reader.py:422, in Reader.read.
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/reader.py:462, in Reader.read_files(self, file_instructions, read_config, shuffle_files, disable_shuffling, decode_fn)
459 raise ValueError(msg)
461 # Read serialized example (eventually with tfds_id
)
--> 462 ds = _read_files(
463 file_instructions=file_instructions,
464 read_config=read_config,
465 shuffle_files=shuffle_files,
466 disable_shuffling=disable_shuffling,
467 file_format=self._file_format,
468 )
470 # Parse and decode
471 def parse_and_decode(ex: Tensor) -> TreeDict[Tensor]:
472 # TODO(pierrot): parse_example
uses
473 # tf.io.parse_single_example
. It might be faster to use parse_example
,
474 # after batching.
475 # https://www.tensorflow.org/api_docs/python/tf/io/parse_example
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/reader.py:302, in _read_files(file_instructions, read_config, shuffle_files, disable_shuffling, file_format) 295 if ( 296 shuffle_files 297 and read_config.shuffle_seed is None 298 and tf_compat.get_option_deterministic(read_config.options) is None 299 ): 300 deterministic = False --> 302 ds = instruction_ds.interleave( 303 functools.partial( 304 _get_dataset_from_filename, 305 do_skip=do_skip, 306 do_take=do_take, 307 file_format=file_format, 308 add_tfds_id=read_config.add_tfds_id, 309 override_buffer_size=read_config.override_buffer_size, 310 ), 311 cycle_length=cycle_length, 312 block_length=block_length, 313 num_parallel_calls=read_config.num_parallel_calls_for_interleave_files, 314 deterministic=deterministic, 315 ) 317 return assert_cardinality_and_apply_options(ds)
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/dataset_ops.py:2534, in DatasetV2.interleave(self, map_func, cycle_length, block_length, num_parallel_calls, deterministic, name) 2530 # Loaded lazily due to a circular dependency ( 2531 # dataset_ops -> interleave_op -> dataset_ops). 2532 # pylint: disable=g-import-not-at-top,protected-access 2533 from tensorflow.python.data.ops import interleave_op -> 2534 return interleave_op._interleave(self, map_func, cycle_length, block_length, 2535 num_parallel_calls, deterministic, name)
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/interleave_op.py:49, in _interleave(input_dataset, map_func, cycle_length, block_length, num_parallel_calls, deterministic, name) 46 return _InterleaveDataset( 47 input_dataset, map_func, cycle_length, block_length, name=name) 48 else: ---> 49 return _ParallelInterleaveDataset( 50 input_dataset, 51 map_func, 52 cycle_length, 53 block_length, 54 num_parallel_calls, 55 deterministic=deterministic, 56 name=name)
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/interleave_op.py:119, in _ParallelInterleaveDataset.init(self, input_dataset, map_func, cycle_length, block_length, num_parallel_calls, buffer_output_elements, prefetch_input_elements, deterministic, name)
117 """See Dataset.interleave()
for details."""
118 self._input_dataset = input_dataset
--> 119 self._map_func = structured_function.StructuredFunctionWrapper(
120 map_func, self._transformation_name(), dataset=input_dataset)
121 if not isinstance(self._map_func.output_structure, dataset_ops.DatasetSpec):
122 raise TypeError(
123 "The map_func
argument must return a Dataset
object. Got "
124 f"{dataset_ops.get_type(self._map_func.output_structure)!r}.")
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/structured_function.py:265, in StructuredFunctionWrapper.init(self, func, transformation_name, dataset, input_classes, input_shapes, input_types, input_structure, add_to_graph, use_legacy_function, defun_kwargs)
258 warnings.warn(
259 "Even though the tf.config.experimental_run_functions_eagerly
"
260 "option is set, this option does not apply to tf.data functions. "
261 "To force eager execution of tf.data functions, please use "
262 "tf.data.experimental.enable_debug_mode()
.")
263 fn_factory = trace_tf_function(defun_kwargs)
--> 265 self._function = fn_factory()
266 # There is no graph to add in eager mode.
267 add_to_graph &= not context.executing_eagerly()
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:1251, in Function.get_concrete_function(self, *args, kwargs) 1249 def get_concrete_function(self, *args, *kwargs): 1250 # Implements PolymorphicFunction.get_concrete_function. -> 1251 concrete = self._get_concrete_function_garbage_collected(args, kwargs) 1252 concrete._garbage_collector.release() # pylint: disable=protected-access 1253 return concrete
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:1221, in Function._get_concrete_function_garbage_collected(self, *args, **kwargs) 1219 if self._variable_creation_config is None: 1220 initializers = [] -> 1221 self._initialize(args, kwargs, add_initializers_to=initializers) 1222 self._initialize_uninitialized_variables(initializers) 1224 if self._created_variables: 1225 # In this case we have created variables on the first call, so we run the 1226 # version which is guaranteed to never create variables.
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:696, in Function._initialize(self, args, kwds, add_initializers_to) 691 self._variable_creation_config = self._generate_scoped_tracing_options( 692 variable_capturing_scope, 693 tracing_compilation.ScopeType.VARIABLE_CREATION, 694 ) 695 # Force the definition of the function for these arguments --> 696 self._concrete_variable_creation_fn = tracing_compilation.trace_function( 697 args, kwds, self._variable_creation_config 698 ) 700 def invalid_creator_scope(*unused_args, **unused_kwds): 701 """Disables variable creation."""
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py:178, in trace_function(args, kwargs, tracing_options) 175 args = tracing_options.input_signature 176 kwargs = {} --> 178 concrete_function = _maybe_define_function( 179 args, kwargs, tracing_options 180 ) 182 if not tracing_options.bind_graph_to_function: 183 concrete_function._garbage_collector.release() # pylint: disable=protected-access
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py:283, in _maybe_define_function(args, kwargs, tracing_options) 281 else: 282 target_func_type = lookup_func_type --> 283 concrete_function = _create_concrete_function( 284 target_func_type, lookup_func_context, func_graph, tracing_options 285 ) 287 if tracing_options.function_cache is not None: 288 tracing_options.function_cache.add( 289 concrete_function, current_func_context 290 )
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py:310, in _create_concrete_function(function_type, type_context, func_graph, tracing_options) 303 placeholder_bound_args = function_type.placeholder_arguments( 304 placeholder_context 305 ) 307 disable_acd = tracing_options.attributes and tracing_options.attributes.get( 308 attributes_lib.DISABLE_ACD, False 309 ) --> 310 traced_func_graph = func_graph_module.func_graph_from_py_func( 311 tracing_options.name, 312 tracing_options.python_function, 313 placeholder_bound_args.args, 314 placeholder_bound_args.kwargs, 315 None, 316 func_graph=func_graph, 317 add_control_dependencies=not disable_acd, 318 arg_names=function_type_utils.to_arg_names(function_type), 319 create_placeholders=False, 320 ) 322 transform.apply_func_graph_transforms(traced_func_graph) 324 graph_capture_container = traced_func_graph.function_captures
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/func_graph.py:1059, in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, createplaceholders)
1056 return x
1058 , original_func = tf_decorator.unwrap(python_func)
-> 1059 func_outputs = python_func(*func_args, **func_kwargs)
1061 # invariant: func_outputs
contains only Tensors, CompositeTensors,
1062 # TensorArrays and None
s.
1063 func_outputs = variable_utils.convert_variables_to_tensors(func_outputs)
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:599, in Function._generate_scoped_tracing_options.
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/structured_function.py:231, in StructuredFunctionWrapper.init.
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/data/ops/structured_function.py:161, in StructuredFunctionWrapper.init.
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/autograph/impl/api.py:690, in convert.
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/autograph/impl/api.py:352, in converted_call(f, args, kwargs, caller_fn_scope, options) 349 new_args = f.args + args 350 logging.log(3, 'Forwarding call of partial %s with\n%s\n%s\n', f, new_args, 351 new_kwargs) --> 352 return converted_call( 353 f.func, 354 new_args, 355 new_kwargs, 356 caller_fn_scope=caller_fn_scope, 357 options=options) 359 if inspect_utils.isbuiltin(f): 360 if f is eval:
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/autograph/impl/api.py:331, in converted_call(f, args, kwargs, caller_fn_scope, options) 329 if conversion.is_in_allowlist_cache(f, options): 330 logging.log(2, 'Allowlisted %s: from cache', f) --> 331 return _call_unconverted(f, args, kwargs, options, False) 333 if ag_ctx.control_status_ctx().status == ag_ctx.Status.DISABLED: 334 logging.log(2, 'Allowlisted: %s: AutoGraph is disabled in context', f)
File /usr/local/lib/python3.12/dist-packages/tensorflow/python/autograph/impl/api.py:459, in _call_unconverted(f, args, kwargs, options, update_cache) 456 return f.self.call(args, kwargs) 458 if kwargs is not None: --> 459 return f(*args, *kwargs) 460 return f(args)
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/reader.py:69, in _get_dataset_from_filename(instruction, do_skip, do_take, file_format, add_tfds_id, override_buffer_size) 60 def _get_dataset_from_filename( 61 instruction: _Instruction, 62 do_skip: bool, (...) 66 override_buffer_size: Optional[int] = None, 67 ) -> tf.data.Dataset: 68 """Returns a tf.data.Dataset instance from given instructions.""" ---> 69 ds = file_adapters.ADAPTER_FOR_FORMAT[file_format].make_tf_data( 70 instruction.filepath, buffer_size=override_buffer_size 71 ) 72 if do_skip: 73 ds = ds.skip(instruction.skip)
File /usr/local/lib/python3.12/dist-packages/tensorflow_datasets/core/file_adapters.py:267, in ArrayRecordFileAdapter.make_tf_data(cls, filename, buffer_size)
260 @classmethod
261 def make_tf_data(
262 cls,
263 filename: epath.PathLike,
264 buffer_size: int | None = None,
265 ) -> tf.data.Dataset:
266 """Returns TensorFlow Dataset comprising given array record file."""
--> 267 raise NotImplementedError(
268 '.as_dataset()
not implemented for ArrayRecord files. Please, use'
269 ' .as_data_source()
.'
270 )
NotImplementedError: .as_dataset()
not implemented for ArrayRecord files. Please, use .as_data_source().
`
Can you try with the following instead?
builder = tfds.builder('oxford_iiit_pet') builder.info.set_file_format(tfds.core.FileFormat.PARQUET, override=True, override_if_initialized=True) builder.download_and_prepare()
dataset, info = tfds.load('oxford_iiit_pet:4.0.0', download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.PARQUET})
Hi @pierrot0, I tried on both my 1. local system and 2. Colab. I used the PARQUET format like u mentioned. Getting something like the following :
I also tried to implement using only build
using builder.as_data_source() is giving us the result
{'train': ArrayRecordDataSource(name=oxford_iiit_pet, split='train', decoders=None),
'test': ArrayRecordDataSource(name=oxford_iiit_pet, split='test', decoders=None)}
/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET
Short description tfds plant_leaves is not getting loaded successfully. It's throwing NotImplementedError. Tried on May 16, 2024
Environment information
Operating System: Windows 11
Python version: 3.10.12
tensorflow-datasets
/tfds-nightly
version: 4.9.4tensorflow
/tf-nightly
version: version: 2.15.0Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? YupReproduction instructions
Gives:
Expected behavior To load dataset successfully.