tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.27k stars 1.53k forks source link

Higgs Dataset - ValueError on download_and_prepare() #5428

Open zwouter opened 3 months ago

zwouter commented 3 months ago

Short description The Higgs dataset cannot be used, probably because it contains unexpected missing values.

Environment information

Reproduction instructions

ds_builder = tfds.builder('higgs')
ds_builder.download_and_prepare()

Logs

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\WSUIDGEE\tensorflow_datasets\higgs\2.0.0...
Extraction completed...: 0 file [00:00, ? file/s]████████████████████████████████████████| 1/1 [00:00<00:00, 157.03 url/s] 
Dl Size...: 100%|█████████████████████████████████████████████| 2816407858/2816407858 [00:00<00:00, 300620199629.49 MiB/s] 
Dl Completed...: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 96.44 url/s] 
Generating splits...:   0%|                                                                    | 0/1 [00:00<?, ? splits/s] 
Traceback (most recent call last):
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 105, in <module>
    evaluate_configuration(
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 87, in evaluate_configuration
    ds = Dataset(dataset)
         ^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 17, in __init__
    trains_ds, vals_ds, test_ds = self.__load_dataset(dataset_name, k_folds)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 46, in __load_dataset
    ds_builder.download_and_prepare()
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\logging\__init__.py", line 168, in __call__
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 691, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 1584, in _download_and_prepare
    future = split_builder.submit_split_generation(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 341, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 417, in _build_from_generator
    utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 415, in _build_from_generator
    example = self._features.encode_example(example)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 243, in encode_example
    utils.reraise(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 241, in encode_example
    example[k] = feature.encode_example(example_value)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\tensor_feature.py", line 175, in encode_example
    example_data = np.array(example_data, dtype=np_dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Failed to encode example:
{'class_label': '1.000000000000000000e+00', 'lepton_pT': '3.647371232509613037e-01', 'lepton_eta': '1.489144206047058105e+00', 'lepton_phi': '3.394368290901184082e-01', 'missing_energy_magnitude': '1.493860602378845215e+00', 'missing_energy_phi': '-1.723330497741699219e+00', 'jet_1_pt': '7.524616718292236328e-01', 'jet_1_eta': '-2.802605032920837402e-01', 'jet_1_phi': '-4.207125604152679443e-01', 'jet_1_b-tag': '2.173076152801513672e+00', 'jet_2_pt': '', 'jet_2_eta': None, 'jet_2_phi': None, 'jet_2_b-tag': None, 'jet_3_pt': None, 'jet_3_eta': None, 'jet_3_phi': None, 'jet_3_b-tag': None, 'jet_4_pt': None, 'jet_4_eta': None, 'jet_4_phi': None, 'jet_4_b-tag': None, 'm_jj': None, 'm_jjj': None, 'm_lv': None, 'm_jlv': None, 'm_bb': None, 'm_wbb': None, 'm_wwbb': None}
In <Tensor> with name "jet_2_pt":
could not convert string to float: ''

Expected behavior I expect the dataset to be downloaded and prepared such that I can quickly load it in the future.

Additional context I am new to using tfds, but other datasets (e.g. MNIST, CIFAR10) work as intended. The dataset is not supposed to have missing values, according to https://archive.ics.uci.edu/dataset/280/higgs

marcenacp commented 3 months ago

Could this be an issue with Windows? I don't reproduce locally and I can successfully download_and_prepare the dataset. If the problem persists, you could also try to filter missing values (example).

If you find a fix for windows, please feel free to push a PR that fixes the issue :) Thanks!