How can I train and test my image dataset that is not a face?

logicmixtape commented 4 years ago

Thank you for the wonderful post! I have a question.
I want to detect pen anomalies using arcface.
Image data already exists.

peteryuX commented 4 years ago

Hi @reeen115 , you can prepare your own dataset like structure bellow. (like the original training dataset MS-Celeb-1M)

/your/path/to/dataset/
    -> 0
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> 1
        -> ...
    -> 2
        -> ...

However, I think your task (pen anomalies) might not be really suitable using arcface loss, because the anomaly detection is often applied on unlabeled data. Maybe other specific paper would be more helpful for you.

logicmixtape commented 4 years ago

Thanks for your advice. I edited the ./config/*.yaml files.

sub_name = arc_res50_pen train_dataset = ./data/train_pen num_classes = 2 (OK , NG) num_samples = 1000 (total of OK images and NG images) test_dataset = ./data/test_pen

train python train.py --mode 'eager_tf' --cfg_path "./configs/arc_res50_pen.yaml"

and test python test.py --cfg_path "./configs/arc_res50_pen.yaml" Is this all I should do?

peteryuX commented 4 years ago

@reeen115 The part of training seems okay, and take care about tuning your hyper parameters. (BTW the number of sample seems really small, training might not be efficient.)

The part of testing, you need to additionally modify the line 50~70 in test.py to meet what you need. The original testing dataset contain samples structure like (img1, img2, is_same). It's probably like that computing the distance between the embedding vector distance between the OK and NG samples (check the related information in ./modules/evaluations.py), which helps you to understand the performance.

logicmixtape commented 4 years ago

I did python train.py --cfg_path "./configs/arc_res50_pen.yaml"

I got this

2019-11-13 16:41:36.668726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:41:38.464480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:41:38.501135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.507440: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.511371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:38.514595: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:41:38.519785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.526468: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.532074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:39.134768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:41:39.139410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-13 16:41:39.141551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-13 16:41:39.144632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:41:44.670018  6156 train.py:42] load my dataset.
2019-11-13 16:49:02.024762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:49:03.756048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:49:03.786320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.793860: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.798482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:03.801963: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:49:03.808455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.814553: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.819148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:04.410137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:49:04.414202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-13 16:49:04.416258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-13 16:49:04.419851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:49:09.734443 18968 train.py:42] load ms1m dataset.
[*] training from scratch.
Train for 59 steps
Epoch 1/5
2019-11-13 16:49:22.018330: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data : Access denied.
; Input/output error
         [[{{node IteratorGetNext}}]]
2019-11-13 16:49:22.549401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-13 16:49:24.644105: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data :Access denied.
; Input/output error
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_8]]
 1/59 [..............................] - ETA: 11:22Traceback (most recent call last):
  File "train.py", line 136, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train.py", line 132, in main
    initial_epoch=epochs - 1)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call
    ctx=ctx)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 61, in quick_execute
    num_outputs)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 120: invalid start byte

`

peteryuX commented 4 years ago

Did you run the related code to convert the data to tfrecord files for training like the original implement in this repository?

# Binary Image: convert really slow, but loading faster when traning.
python data/convert_train_binary_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m_bin.tfrecord"

# Online Decoding: convert really fast, but loading slower when training.
python data/convert_train_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m.tfrecord"

logicmixtape commented 4 years ago

I forgot to write python data/convert_train_binary_tfrecord.py --dataset_path "./data/train_data" --output_path "./data/pen.tfrecord"

2019-11-13 17:49:17.610647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
I1113 17:49:19.314568 10432 convert_train_binary_tfrecord.py:48] Loading ./data/train_data
I1113 17:49:19.315593 10432 convert_train_binary_tfrecord.py:51] Reading data list...
100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 167.52it/s]
I1113 17:49:19.333978 10432 convert_train_binary_tfrecord.py:59] Writing tfrecord file...
  0%|                                                                               | 0/950 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "data/convert_train_binary_tfrecord.py", line 70, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "data/convert_train_binary_tfrecord.py", line 63, in main
    source_id=int(id_name),
ValueError: invalid literal for int() with base 10: 'OK'

peteryuX commented 4 years ago

Convert

/your/path/to/dataset/
    -> OK
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> NG
        -> ...

to

/your/path/to/dataset/
    -> 0
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> 1
        -> ...

These bug is the int() convert error, you can find the detail from google by yourself.

logicmixtape commented 4 years ago

As a result of various trials, if you try to train the model, you will never get out of this error forever. Do you know any solutions? I want to know your detailed execution environment.

`2019-11-14 19:00:40.234637: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-14 19:00:42.276039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-14 19:00:42.305700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.312380: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.317878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.320481: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-11-14 19:00:42.326245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.331967: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.337288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.927794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-14 19:00:42.931382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-14 19:00:42.933716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-14 19:00:42.936699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4606 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1114 19:00:47.251486 12776 train.py:42] load ms1m dataset.
[*] training from scratch.
2019-11-14 19:00:47.767252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2019-11-14 19:00:49.075699: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-11-14 19:00:49.079945: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "train.py", line 136, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train.py", line 78, in main
    logist = model(inputs, training=True)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\layers\convolutional.py", line 197, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1134, in __call__
    return self.conv_op(inp, filter)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 639, in __call__
    return self.call(inp, filter)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 238, in __call__
    name=self.name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d
    name=name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1031, in conv2d
    data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1130, in conv2d_eager_fallback
    ctx=_ctx, name=name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]`

peteryuX commented 4 years ago

It seems like a problem with cuDNN version incompatibility.

Take a look at this solution, hope it can sovle your problem. https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-457425190

My environment:

nvidia driver 436.48
CUDA 10.0
cudnn 7.6.3
Tensorflow-gpu 2.0.0

peteryuX / arcface-tf2

How can I train and test my image dataset that is not a face? #1