Closed DanielNehemiah closed 3 years ago
Hi!
Let it run more iterations.
In the original code, the validation is evaluated every 1000 iterations. Se after 3000 / 5000 iterations if the loss keeps constant or not.
Let me know how it goes.
@DanielNehemiah There's a stopping condition in the training function , so let it run as long as the validation error between 2 sets i.e 1000 iterations does not improve and the training ends on its own
I've tried for several times, and it always stopped after 10000 iterations, which means that the model cannot converge..
@JacksonLaw577 Could you give more details?
When I ran your original code, this error occurred. So I modified self.lr into self.lr0. I don't know if this is the reason.
@JacksonLaw577 Are you running the exact same dataset? I developed this on an older version of Keras and Tensorflow. Haven't been able to retry on the newer versions.
/home/user/anaconda3/envs/tensorflow-gpu/bin/python3.6 "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py"
2021-02-20 17:56:39.594927: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
2021-02-20 17:56:40.525094: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-20 17:56:40.525605: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-20 17:56:40.553530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-20 17:56:40.554140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 14 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-02-20 17:56:40.554175: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-20 17:56:40.558338: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-20 17:56:40.558412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-20 17:56:40.559689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-20 17:56:40.559968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-20 17:56:40.560085: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-02-20 17:56:40.560968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-20 17:56:40.561058: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-02-20 17:56:40.561073: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-02-20 17:56:40.561376: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-20 17:56:40.562933: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-20 17:56:40.562966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-20 17:56:40.562976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]
2021-02-20 17:56:40.606270: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
2021-02-20 17:56:40.623839: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
2021-02-20 17:56:40.641658: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 150994944 exceeds 10% of free system memory.
['Alphabet_of_theMagi', 'Tifinagh', 'Gujarati', 'Syriac(Estrangelo)', 'Futurama', 'EarlyAramaic', 'Latin', 'Japanese(hiragana)', 'Grantha', 'Sanskrit', 'Greek', 'Burmese(Myanmar)', 'Mkhedruli(Georgian)', 'Asomtavruli_(Georgian)', 'Anglo-SaxonFuthorc', 'Arcadian', 'Balinese', 'Japanese(katakana)', 'Blackfoot_(Canadian_AboriginalSyllabics)', 'Tagalog', 'Armenian', 'Inuktitut(Canadian_AboriginalSyllabics)', 'Korean', 'Bengali', 'Ojibwe(Canadian_Aboriginal_Syllabics)', 'Cyrillic', 'Braille', 'NKo', 'Hebrew', 'Malay(Jawi_-_Arabic)']
30
Traceback (most recent call last):
File "/home/user/ShadowCreative/Siamese network/Siamese-Networks-for-One-Shot-Learning/train_siamese_network.py", line 58, in
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:805 train_function *
return step_function(self, iterator)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:795 step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
return fn(*args, **kwargs)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:788 run_step **
outputs = model.train_step(data)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py:757 train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:497 minimize
loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
/home/user/anaconda3/envs/tensorflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:547 _compute_gradients
with ops.name_scope_v2(self._name + "/gradients"):
TypeError: unsupported operand type(s) for +: 'Modified_SGD' and 'str'
Process finished with exit code 1 ################################################# HI, I met this problems, can you help me look?
Answered in #14
Hey,
I have started to train the network using the code in this repo. I see that the training accuracy has not gone above 0.65 and is mostly revolving around 0.45-0.52 in the first 700 iterations. Is this normal? the loss is also changing very minutely revolving around 5.1
Thanks for this code!