Unable to use run_training.py with custom dataset

cyrilzakka commented 4 years ago

Trying to use run_training.py as so: !python run_training.py --num-gpus=1 --data-dir=dataset --config=config-f --dataset=blows --mirror-augment=false --metric=none --total-kimg=20000 --result-dir="/content/drive/My Drive/stylegan2/results" gives me the following error:

Local submit - run_dir: /content/drive/My Drive/stylegan2/results/00026-stylegan2-blows-1gpu-config-f
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Traceback (most recent call last):
  File "run_training.py", line 209, in <module>
    main()
  File "run_training.py", line 204, in main
    run(**vars(args))
  File "run_training.py", line 129, in run
    dnnlib.submit_run(**kwargs)
  File "/content/stylegan2/dnnlib/submission/submit.py", line 343, in submit_run
    return farm.submit(submit_config, host_run_dir)
  File "/content/stylegan2/dnnlib/submission/internal/local.py", line 22, in submit
    return run_wrapper(submit_config)
  File "/content/stylegan2/dnnlib/submission/submit.py", line 280, in run_wrapper
    run_func_obj(**submit_config.run_func_kwargs)
  File "/content/stylegan2/training/training_loop.py", line 156, in training_loop
    training_set = dataset.load_dataset(data_dir=dnnlib.convert_path(data_dir), verbose=True, **dataset_args)
  File "/content/stylegan2/training/dataset.py", line 239, in load_dataset
    dataset = dnnlib.util.get_obj_by_name(class_name)(**adjusted_kwargs)
  File "/content/stylegan2/training/dataset.py", line 167, in __init__
    dset = dset.map(parse_tfrecord_tf_raw, num_parallel_calls=num_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 1913, in map
    self, map_func, num_parallel_calls, preserve_cardinality=False))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 3472, in __init__
    use_legacy_function=use_legacy_function)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2713, in __init__
    self._function = wrapper_fn._get_concrete_function_internal()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1853, in _get_concrete_function_internal
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1847, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2147, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 2038, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2707, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 2652, in _wrapper_helper
    ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in converted code:

    /content/stylegan2/training/dataset.py:27 parse_tfrecord_tf_raw  *
        features = tf.parse_single_example(
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/parsing_ops.py:1019 parse_single_example
        serialized, features, example_names, name
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/parsing_ops.py:1063 parse_single_example_v2_unoptimized
        return parse_single_example_v2(serialized, features, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/parsing_ops.py:2093 parse_single_example_v2
        dense_defaults, dense_shapes, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/parsing_ops.py:2210 _parse_single_example_v2_raw
        name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_parsing_ops.py:1201 parse_single_example
        dense_shapes=dense_shapes, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py:551 _apply_op_helper
        (prefix, dtypes.as_dtype(input_arg.type).name))

    TypeError: Input 'serialized' of 'ParseSingleExample' Op has type uint8 that does not match expected type of string.

yet running the following returns no errors:

features = tf.parse_single_example(
        '/content/stylegan2/dataset/blows/blows-r07.tfrecords',
        features={
            "shape": tf.FixedLenFeature([3], tf.int64),
            "img": tf.FixedLenFeature([], tf.string),
        },
    )

skyflynil commented 4 years ago

What is your custom data format? jpeg images or array files?

cyrilzakka commented 4 years ago

@skyflynil (1024, 1024, 3) JPEG images in a folder

skyflynil commented 4 years ago

I assume you created the tfrecord using: !python dataset_tool.py create_from_images_raw --res_log2=8 ./dataset/dataset_name untared_raw_image_dir and added --min-h=4 --min-w=4 --res-log2=8 parameters for run_training.py?

cyrilzakka commented 4 years ago

I didn't add the extra arguments. I left them as defaults: !python dataset_tool.py create_from_images_raw dataset/blows blows since I also used the defaults for !python run_training.py. Should I add them and try again?

skyflynil commented 4 years ago

The default for run_training.py is to train 512*512 (min-h=4, min-w=4, res-log2=7) images, but the error does not seems to be related though. Usually there will be shape not equal issue when feeding the image during actual training. The error you got seems to complain the tfrecord itself which I don't quite understand. And you are using tensorflow 1.15.0 right?

cyrilzakka commented 4 years ago

Yes 😔 I've also noticed another user with the issue in a fork of your repo: https://github.com/pbaylies/stylegan2/issues/2#issuecomment-586640167

skyflynil commented 4 years ago

Yes 😔 I've also noticed another user with the issue in a fork of your repo: pbaylies#2 (comment)

It seems to me the issue there was: The record is created using "create_from_images_images" , and training is by default using decoded format ( probably pbaylies changed the default behavior)

The fixed mentioned there is basically matching the reading part to the creation stage.

cyrilzakka commented 4 years ago

There's the second part of that issue mentioned by user @pender which seems to have the exact same stack trace as I do.

Also, what is the difference betweencreate_from_images_raw and create_from_images? If create_from_images_raw also expects a directory of images, what are the benefits of using create_from_images_raw? I get that it reads the images as bytes, but should I use create_from_images to see if that helps with the problem?

skyflynil commented 4 years ago

create_from_images_raw directly puts jpeg/png images into tfrecord without decoding while create_from_images first decode the image into numpy arrays then put into the tfrecord. The tradeoff is create_from_images_raw reduce the record size while during training have to pay the penalty of decoding the images again and again. For my repo, the default behavior is training using tfrecord from create_from_images_raw, while for pbaylies's, assuming create_from_images

cyrilzakka commented 4 years ago

I've discovered that the error only happens when usingcreate_from_images_raw and not create_from_images. Will investigate further after my exams on Monday.

Is there also a way of changing the output of styleGAN? I'm currently getting the following error when running run_training.y

ValueError: Dimension 2 in both shapes must be equal, but are 1024 and 64. Shapes are [?,3,1024,1024] and [?,3,64,64].

I know you said default image size is 512, even after resizing all my images to 512*512, I still get the error with unequal dimensions of 512 and 64. What am I doing wrong? Is there any way of changing the GAN input/output to 1024?

skyflynil / stylegan2

Unable to use run_training.py with custom dataset #7