tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.79k forks source link

Why can't I do the extract feature (DELF)? #9698

Open intouch1233 opened 3 years ago

intouch1233 commented 3 years ago

I have trained my own data and this is the script I use.

Training

python3 train.py --train_file_pattern=/home/sornnarong/workspace/share_drive_31/dataset/AiProducts-Challenge-master/tfrecord2/train --validation_file_pattern=/home/sornnarong/workspace/share_drive_31/dataset/AiProducts-Challenge-master/tfrecord2/validation --imagenet_checkpoint=resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5 --dataset_version=ai_product --logdir=aiproducts_training3

Export model

python3 model/export_local_model.py --ckpt_path=aiproducts_training3/delf_weights --export_path=aiproducts_training3_model

python3 model/export_local_model.py --ckpt_path=aiproducts_training3/delf_weights --export_path=aiproducts_training3_model 2021-02-03 00:16:17.125601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-02-03 00:16:20.054145: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-03 00:16:20.056257: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-02-03 00:16:20.114068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.114449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.65GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-02-03 00:16:20.114467: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-02-03 00:16:20.162971: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-02-03 00:16:20.163056: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-02-03 00:16:20.192167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-02-03 00:16:20.213417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-02-03 00:16:20.233074: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-02-03 00:16:20.254751: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-02-03 00:16:20.261000: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-02-03 00:16:20.261269: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.262565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.263735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-02-03 00:16:20.265377: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-03 00:16:20.265754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.267338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.65GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-02-03 00:16:20.267415: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-02-03 00:16:20.267505: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-02-03 00:16:20.267571: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-02-03 00:16:20.267631: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-02-03 00:16:20.267689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-02-03 00:16:20.267752: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-02-03 00:16:20.267809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-02-03 00:16:20.267870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-02-03 00:16:20.268103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.269825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.271322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-02-03 00:16:20.271437: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-02-03 00:16:20.705346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-02-03 00:16:20.705372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-02-03 00:16:20.705378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-02-03 00:16:20.705557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.705967: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.706334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:16:20.706678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9903 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5) Checkpoint loaded from aiproducts_training3/delf_weights 2021-02-03 00:16:20.975485: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. WARNING:tensorflow:Skipping full serialization of Keras layer <delf.python.training.model.resnet50.ResNet50 object at 0x7f63fc5cbf10>, because it is not built. W0203 00:16:24.967820 140069560780608 save_impl.py:78] Skipping full serialization of Keras layer <delf.python.training.model.resnet50.ResNet50 object at 0x7f63fc5cbf10>, because it is not built. WARNING:tensorflow:Skipping full serialization of Keras layer <tensorflow.python.keras.layers.pooling.AveragePooling2D object at 0x7f63f0115d10>, because it is not built. W0203 00:16:30.581585 140069560780608 save_impl.py:78] Skipping full serialization of Keras layer <tensorflow.python.keras.layers.pooling.AveragePooling2D object at 0x7f63f0115d10>, because it is not built. W0203 00:16:38.962221 140069560780608 save.py:241] Found untraced functions such as conv1_layer_call_and_return_conditional_losses, conv1_layer_call_fn, conv1_layer_call_fn, conv1_layer_call_and_return_conditional_losses, conv1_layer_call_and_return_conditional_losses while saving (showing 5 of 5). These functions will not be directly callable after loading. W0203 00:16:39.572054 140069560780608 save.py:241] Found untraced functions such as conv1_layer_call_and_return_conditional_losses, conv1_layer_call_fn, conv1_layer_call_fn, conv1_layer_call_and_return_conditional_losses, conv1_layer_call_and_return_conditional_losses while saving (showing 5 of 5). These functions will not be directly callable after loading. INFO:tensorflow:Assets written to: aiproducts_training3_model/assets I0203 00:16:42.081132 140069560780608 builder_impl.py:775] Assets written to: aiproducts_training3_model/assets WARNING:tensorflow:Unresolved object in checkpoint: (root).desc_classification W0203 00:16:42.494237 140069560780608 util.py:161] Unresolved object in checkpoint: (root).desc_classification WARNING:tensorflow:Unresolved object in checkpoint: (root).attn_classification W0203 00:16:42.494378 140069560780608 util.py:161] Unresolved object in checkpoint: (root).attn_classification WARNING:tensorflow:Unresolved object in checkpoint: (root).desc_classification.kernel W0203 00:16:42.494426 140069560780608 util.py:161] Unresolved object in checkpoint: (root).desc_classification.kernel WARNING:tensorflow:Unresolved object in checkpoint: (root).desc_classification.bias W0203 00:16:42.494526 140069560780608 util.py:161] Unresolved object in checkpoint: (root).desc_classification.bias WARNING:tensorflow:Unresolved object in checkpoint: (root).attn_classification.kernel W0203 00:16:42.494635 140069560780608 util.py:161] Unresolved object in checkpoint: (root).attn_classification.kernel WARNING:tensorflow:Unresolved object in checkpoint: (root).attn_classification.bias W0203 00:16:42.494672 140069560780608 util.py:161] Unresolved object in checkpoint: (root).attn_classification.bias WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. W0203 00:16:42.494736 140069560780608 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Extract Features

python3 ./extract_features.py --delf_config_path delf_config_example.pbtxt --list_images_path list_images.txt --output_dir ./aiproduct_features

Below is my extract features error logs.

2021-02-03 00:26:41.075329: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-02-03 00:26:41.075371: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Reading list of images... done! Found 2 images 2021-02-03 00:26:42.994139: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-03 00:26:42.995276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-02-03 00:26:43.000722: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-03 00:26:43.001348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:02:00.0 name: GRID V100D-16C computeCapability: 7.0 coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 16.00GiB deviceMemoryBandwidth: 836.37GiB/s 2021-02-03 00:26:43.001508: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.001614: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.001705: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.001798: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.001895: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.001979: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.002063: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory 2021-02-03 00:26:43.002264: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-02-03 00:26:43.002283: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-02-03 00:26:43.002536: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-02-03 00:26:43.002749: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-03 00:26:43.002789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-02-03 00:26:43.002836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
Starting to extract DELF features from images... image shape -- (1000, 1000, 3) 2021-02-03 00:26:48.357299: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-02-03 00:26:48.434554: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2294605000 Hz 2021-02-03 00:26:49.147637: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 256000000 exceeds 10% of free system memory. 2021-02-03 00:26:49.350257: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 256000000 exceeds 10% of free system memory. 2021-02-03 00:26:49.798080: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 254977024 exceeds 10% of free system memory. 2021-02-03 00:26:49.890861: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 254977024 exceeds 10% of free system memory. 2021-02-03 00:26:49.990515: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 254977024 exceeds 10% of free system memory. Traceback (most recent call last): File "./extract_features.py", line 146, in app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/mls/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run _run_main(main, args) File "/home/mls/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./extract_features.py", line 109, in main extracted_features = extractor_fn(im) File "/home/mls/workspace/API/models/research/delf/delf/python/examples/extractor.py", line 199, in ExtractorFn input_abs_thres=score_threshold_tensor) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1669, in call return self._call_impl(args, kwargs) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1679, in _call_impl cancellation_manager) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1762, in _call_with_structured_signature cancellation_manager=cancellation_manager) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 116, in _call_flat cancellation_manager) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/home/mls/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 15488 values, but the requested shape requires a multiple of 1024 [[{{node StatefulPartitionedCall/while/body/_341/while/Reshape_2}}]] [Op:__inference_signature_wrapper_21391]

Function call stack: signature_wrapper

Below is my delf config

use_local_features: true use_global_features: false model_path: "parameters/aiproducts_training3_model/" image_scales: .25 image_scales: .3536 image_scales: .5 image_scales: .7071 image_scales: 1.0 image_scales: 1.4142 image_scales: 2.0 is_tf2_exported: true

delf_local_config { use_pca: false max_feature_num: 1000 score_threshold: 100.0

}

max_image_size: 1024

Very thx.

andrefaraujo commented 3 years ago

Thanks for the detailed notes, very helpful for debugging. I believe I found the issue.

Quick hack to make it work: edit here the feature_depth to 128 (or whatever dimension you used for the autoencoder), instead of 1024. Then re-export the model, and extraction should work at that point.

A few options to fix this bug: 1) Make the export_model_utils.ExtractLocalFeatures accept a new argument feature_depth=1024, which can be set here. This would replace the hard-coded feature_depth mentioned above. 2) Rewrite export_model_utils.ExtractLocalFeatures to make it more similar to ExtractLocalAndGlobalFeatures, which uses autograph and does not actually require the feature_depth to be set. Basically, this would correspond to this TODO.

I think (2) is probably the best way to go in a long-term perspective. @dan-anghel , would you be interested in tackling (2)?

dan-anghel commented 3 years ago

Hi @andrefaraujo ! Sure, I can take a look at it and make the changes.

Richard-M-chen commented 3 years ago

@intouch1233 Would you like to share the specific code or script you used to train your dataset