Training loss does not change in the first 700 iterations

DanielNehemiah commented 4 years ago

Hey,

I have started to train the network using the code in this repo. I see that the training accuracy has not gone above 0.65 and is mostly revolving around 0.45-0.52 in the first 700 iterations. Is this normal? the loss is also changing very minutely revolving around 5.1

Thanks for this code!

tensorfreitas commented 4 years ago

Hi!

Let it run more iterations.

In the original code, the validation is evaluated every 1000 iterations. Se after 3000 / 5000 iterations if the loss keeps constant or not.

Let me know how it goes.

aplusc98 commented 4 years ago

@DanielNehemiah There's a stopping condition in the training function , so let it run as long as the validation error between 2 sets i.e 1000 iterations does not improve and the training ends on its own

JacksonLaw577 commented 3 years ago

I've tried for several times, and it always stopped after 10000 iterations, which means that the model cannot converge..

tensorfreitas commented 3 years ago

@JacksonLaw577 Could you give more details?

JacksonLaw577 commented 3 years ago

When I ran your original code, this error occurred. So I modified self.lr into self.lr0. I don't know if this is the reason. 微信图片_20201124142956

tensorfreitas commented 3 years ago

@JacksonLaw577 Are you running the exact same dataset? I developed this on an older version of Keras and Tensorflow. Haven't been able to retry on the newer versions.

DAVID-Hown commented 3 years ago

/home/user/anaconda3/envs/tensorflow-gpu/bin/python3.6 "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py" 2021-02-20 17:56:39.594927: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 Using TensorFlow backend. 2021-02-20 17:56:40.525094: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-20 17:56:40.525605: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-02-20 17:56:40.553530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-02-20 17:56:40.554140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5 coreClock: 1.71GHz coreCount: 14 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s 2021-02-20 17:56:40.554175: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-02-20 17:56:40.558338: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-02-20 17:56:40.558412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-02-20 17:56:40.559689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-02-20 17:56:40.559968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-02-20 17:56:40.560085: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory 2021-02-20 17:56:40.560968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-02-20 17:56:40.561058: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory 2021-02-20 17:56:40.561073: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-02-20 17:56:40.561376: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-02-20 17:56:40.562933: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-20 17:56:40.562966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-02-20 17:56:40.562976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
2021-02-20 17:56:40.606270: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory. 2021-02-20 17:56:40.623839: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory. 2021-02-20 17:56:40.641658: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory. ['Alphabet_of_theMagi', 'Tifinagh', 'Gujarati', 'Syriac(Estrangelo)', 'Futurama', 'EarlyAramaic', 'Latin', 'Japanese(hiragana)', 'Grantha', 'Sanskrit', 'Greek', 'Burmese(Myanmar)', 'Mkhedruli(Georgian)', 'Asomtavruli_(Georgian)', 'Anglo-SaxonFuthorc', 'Arcadian', 'Balinese', 'Japanese(katakana)', 'Blackfoot_(Canadian_AboriginalSyllabics)', 'Tagalog', 'Armenian', 'Inuktitut(Canadian_AboriginalSyllabics)', 'Korean', 'Bengali', 'Ojibwe(Canadian_Aboriginal_Syllabics)', 'Cyrillic', 'Braille', 'NKo', 'Hebrew', 'Malay(Jawi_-_Arabic)'] 30 Traceback (most recent call last): File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py", line 58, in main() File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py", line 46, in main model_name='siamese_net_lr10e-4') File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/siamese_network.py", line 235, in train_siamese_network train_loss, train_accuracy = self.model.train_on_batch(images, labels) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1727, in train_on_batch logs = self.train_function(iterator) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in call result = self._call(*args, kwds) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call self._initialize(args, kwds, add_initializers_to=initializers) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize *args, *kwds)) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected graphfunction, = self._maybe_define_function(args, kwargs) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function capture_by_value=self._capture_by_value), File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func func_outputs = python_func(func_args, func_kwargs) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn out = weak_wrapped_fn().wrapped(*args, **kwds) File "/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper raise e.ag_error_metadata.to_exception(e) TypeError: in user code:

/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
    return step_function(self, iterator)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
    return fn(*args, **kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
    outputs = model.train_step(data)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:757 train_step
    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:497 minimize
    loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:547 _compute_gradients
    with ops.name_scope_v2(self._name + "/gradients"):

TypeError: unsupported operand type(s) for +: 'Modified_SGD' and 'str'

Process finished with exit code 1 ################################################# HI, I met this problems, can you help me look?

tensorfreitas commented 3 years ago

Answered in #14

tensorfreitas / Siamese-Networks-for-One-Shot-Learning

Training loss does not change in the first 700 iterations #11