shensq04 / EKLAVYA

56 stars 16 forks source link

When I ran RNN/train.py, the program was interrupted in the midway without giving error message. #4

Closed qingbol closed 4 years ago

qingbol commented 4 years ago

When I use the default value(40) of Processes Number to train RNN Model, below error occurs.

/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Created the model! [Batch 0][Epoch 0] cost: 2.773; accuracy: 0.027 Traceback (most recent call last): File "train.py", line 283, in main() File "train.py", line 279, in main training(config_info) File "train.py", line 218, in training model.train() File "train.py", line 153, in train self._data, self._label, self._length, self._keep_prob) File "train.py", line 33, in fill_feed_dict data_batch = data_set.get_batch(batch_size=batch_size) File "/scratch2/qingbol/EKLAVYA2/code/RNN/train/dataset.py", line 240, in get_batch train_batch = self.get_batch_data(func_list_batch) File "/scratch2/qingbol/EKLAVYA2/code/RNN/train/dataset.py", line 141, in get_batch_data pool = Pool(self.thread_num) File "/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/multiprocessing/init.py", line 232, in Pool return Pool(processes, initializer, initargs, maxtasksperchild) File "/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/multiprocessing/pool.py", line 161, in init self._repopulate_pool() File "/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/multiprocessing/pool.py", line 225, in _repopulate_pool w.start() File "/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/multiprocessing/forking.py", line 121, in init self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

So I decrease the processes numbers to 16, the above error disappear. But another issue come out, the program was interrupted without giving error message.

/home/qingbol/.conda/envs/tf110cpu_py27/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Created the model! [Batch 0][Epoch 0] cost: 2.775; accuracy: 0.062 d[Batch 20][Epoch 0] cost: 1.737; accuracy: 0.297 d[Batch 40][Epoch 0] cost: 1.712; accuracy: 0.355 [Batch 60][Epoch 1] cost: 1.760; accuracy: 0.328 [Batch 80][Epoch 1] cost: 1.668; accuracy: 0.344 Saved the model ... 100 [Batch 100][Epoch 2] cost: 1.505; accuracy: 0.383 [Batch 120][Epoch 2] cost: 1.365; accuracy: 0.555 [Batch 140][Epoch 3] cost: 1.277; accuracy: 0.562 [Batch 160][Epoch 3] cost: 1.084; accuracy: 0.641 [Batch 180][Epoch 3] cost: 0.876; accuracy: 0.730 Saved the model ... 200 [Batch 200][Epoch 4] cost: 0.934; accuracy: 0.695 [Batch 220][Epoch 4] cost: 0.770; accuracy: 0.750 [Batch 240][Epoch 5] cost: 0.734; accuracy: 0.762 [Batch 260][Epoch 5] cost: 0.696; accuracy: 0.734 [Batch 280][Epoch 6] cost: 0.567; accuracy: 0.801 Saved the model ... 300 [Batch 300][Epoch 6] cost: 0.630; accuracy: 0.773 [Batch 320][Epoch 6] cost: 0.746; accuracy: 0.746 [Batch 340][Epoch 7] cost: 0.627; accuracy: 0.789 [Batch 360][Epoch 7] cost: 0.716; accuracy: 0.762 [Batch 380][Epoch 8] cost: 0.455; accuracy: 0.871 Saved the model ... 400 [Batch 400][Epoch 8] cost: 0.477; accuracy: 0.863 [Batch 420][Epoch 9] cost: 0.435; accuracy: 0.844 [Batch 440][Epoch 9] cost: 0.618; accuracy: 0.793 [Batch 460][Epoch 9] cost: 0.462; accuracy: 0.832 [Batch 480][Epoch 10] cost: 0.416; accuracy: 0.840 Saved the model ... 500 [Batch 500][Epoch 10] cost: 0.548; accuracy: 0.801 [Batch 520][Epoch 11] cost: 0.434; accuracy: 0.848

In both cases, it gives the same warning message:

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Any clue or solution?