Closed ljhzxc closed 5 years ago
Check the duration of the audio files in the large test CSV. There's probably a few files that are very long and cause the OOM.
Check the duration of the audio files in the large test CSV. There's probably a few files that are very long and cause the OOM.
The biggest audio file in my train/dev/test csv is no more than 0.6MB. It may not be the reason.
Can you provide a log file?
python3 process is killed because of OOM.
Kernel OOM ? GPU OOM ? There's no mention of the GPU nor its memory, nor the system's memory. I don't see how we can help more, not to mention the lack of any log as requested by @kdavis-mozilla
Can you provide a log file?
Sorry for my late reply. I don't find other log. The followed content is in the /var/log/kern.log. Apr 14 is just the day I try to output model.
Apr 14 09:19:17 gpu kernel: [1942044.048852] [ 4727] 0 4727 16688962 7485317 19123 46 949207 0 python3
Apr 14 09:19:17 gpu kernel: [1942044.048854] [ 4789] 115 4789 4284 300 14 3 45 0 nvidia-persiste
Apr 14 09:19:17 gpu kernel: [1942044.048856] [ 5059] 120 5059 77313 19 60 3 475 -900 postgres
Apr 14 09:19:17 gpu kernel: [1942044.048857] Out of memory: Kill process 4727 (python3) score 494 or sacrifice child
Apr 14 09:19:17 gpu kernel: [1942044.048983] Killed process 4727 (python3) total-vm:66755848kB, anon-rss:0kB, file-rss:29941268kB
Apr 14 09:19:17 gpu kernel: [1942072.385910] irqbalance invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
Apr 14 09:19:17 gpu kernel: [1942072.385914] irqbalance cpuset=/ mems_allowed=0
Apr 14 09:19:17 gpu kernel: [1942072.385918] CPU: 10 PID: 1584 Comm: irqbalance Tainted: P OE 4.4.0-62-generic #83-Ubuntu
Apr 14 09:19:17 gpu kernel: [1942072.385920] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Apr 14 09:19:17 gpu kernel: [1942072.385921] 0000000000000286 0000000092a35c8f ffff8808786b79b0 ffffffff813f7c63
Apr 14 09:19:17 gpu kernel: [1942072.385924] ffff8808786b7b88 ffff880868460000 ffff8808786b7a20 ffffffff8120ad4e
Apr 14 09:19:17 gpu kernel: [1942072.385926] ffffffff8113f5ca ffff8808786b7a50 ffffffff811a6b6d ffff8802300a5460
Apr 14 09:19:17 gpu kernel: [1942072.385928] Call Trace:
Apr 14 09:19:17 gpu kernel: [1942072.385933] [<ffffffff813f7c63>] dump_stack+0x63/0x90
Apr 14 09:19:17 gpu kernel: [1942072.385937] [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
Apr 14 09:19:17 gpu kernel: [1942072.385941] [<ffffffff8113f5ca>] ? __delayacct_freepages_end+0x2a/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385944] [<ffffffff811a6b6d>] ? do_try_to_free_pages+0x2ed/0x410
Apr 14 09:19:17 gpu kernel: [1942072.385947] [<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
Apr 14 09:19:17 gpu kernel: [1942072.385949] [<ffffffff81192ae9>] out_of_memory+0x219/0x460
Apr 14 09:19:17 gpu kernel: [1942072.385952] [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Apr 14 09:19:17 gpu kernel: [1942072.385954] [<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
Apr 14 09:19:17 gpu kernel: [1942072.385956] [<ffffffff8113f276>] ? delayacct_end+0x56/0x60
Apr 14 09:19:17 gpu kernel: [1942072.385960] [<ffffffff811e3f5e>] alloc_pages_vma+0xbe/0x240
Apr 14 09:19:17 gpu kernel: [1942072.385962] [<ffffffff811d49de>] __read_swap_cache_async+0xee/0x140
Apr 14 09:19:17 gpu kernel: [1942072.385964] [<ffffffff811d4a56>] read_swap_cache_async+0x26/0x60
Apr 14 09:19:17 gpu kernel: [1942072.385965] [<ffffffff811d4b95>] swapin_readahead+0x105/0x1b0
Apr 14 09:19:17 gpu kernel: [1942072.385969] [<ffffffff811c1b16>] handle_mm_fault+0x1196/0x1820
Apr 14 09:19:17 gpu kernel: [1942072.385972] [<ffffffff8106b4f7>] __do_page_fault+0x197/0x400
Apr 14 09:19:17 gpu kernel: [1942072.385974] [<ffffffff8106b782>] do_page_fault+0x22/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385978] [<ffffffff8183a778>] page_fault+0x28/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385979] Mem-Info:
Apr 14 09:19:17 gpu kernel: [1942072.385985] active_anon:7465474 inactive_anon:503708 isolated_anon:1024
Apr 14 09:19:17 gpu kernel: [1942072.385985] active_file:206 inactive_file:198 isolated_file:0
Apr 14 09:19:17 gpu kernel: [1942072.385985] unevictable:914 dirty:0 writeback:13 unstable:0
Apr 14 09:19:17 gpu kernel: [1942072.385985] slab_reclaimable:30486 slab_unreclaimable:47402
Apr 14 09:19:17 gpu kernel: [1942072.385985] mapped:7485714 shmem:7970208 pagetables:21503 bounce:0
Apr 14 09:19:17 gpu kernel: [1942072.385985] free:52080 free_pcp:0 free_cma:0
Apr 14 09:19:17 gpu kernel: [1942072.385988] Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.385992] lowmem_reserve[]: 0 1303 32020 32020 32020
Apr 14 09:19:17 gpu kernel: [1942072.385995] Node 0 DMA32 free:125616kB min:2748kB low:3432kB high:4120kB active_anon:922088kB inactive_anon:308776kB active_file:12kB inactive_file:4kB unevictable:472kB isolated(anon):0kB isolated(file):0kB present:1465680kB managed:1385060kB mlocked:472kB dirty:0kB writeback:4kB mapped:925884kB shmem:1230904kB slab_reclaimable:5404kB slab_unreclaimable:5664kB kernel_stack:240kB pagetables:3652kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7511248 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.385999] lowmem_reserve[]: 0 0 30717 30717 30717
Apr 14 09:19:17 gpu kernel: [1942072.386001] Node 0 Normal free:66812kB min:64800kB low:81000kB high:97200kB active_anon:28939808kB inactive_anon:1706056kB active_file:812kB inactive_file:788kB unevictable:3184kB isolated(anon):4096kB isolated(file):0kB present:31981568kB managed:31454928kB mlocked:3184kB dirty:0kB writeback:48kB mapped:29016972kB shmem:30649928kB slab_reclaimable:116540kB slab_unreclaimable:183936kB kernel_stack:5712kB pagetables:82360kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184394568 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.386005] lowmem_reserve[]: 0 0 0 0 0
Apr 14 09:19:17 gpu kernel: [1942072.386008] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 2*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Apr 14 09:19:17 gpu kernel: [1942072.386016] Node 0 DMA32: 242*4kB (UME) 87*8kB (ME) 67*16kB (ME) 42*32kB (ME) 27*64kB (UME) 14*128kB (E) 25*256kB (UME) 16*512kB (UE) 7*1024kB (UE) 1*2048kB (M) 23*4096kB (M) = 125616kB
Apr 14 09:19:17 gpu kernel: [1942072.386025] Node 0 Normal: 815*4kB (MEH) 250*8kB (MEH) 265*16kB (UMEH) 249*32kB (UMEH) 89*64kB (MEH) 63*128kB (UEH) 39*256kB (UEH) 26*512kB (UMEH) 10*1024kB (UEH) 1*2048kB (U) 0*4096kB = 66812kB
Apr 14 09:19:17 gpu kernel: [1942072.386035] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 14 09:19:17 gpu kernel: [1942072.386036] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 14 09:19:17 gpu kernel: [1942072.386037] 7971282 total pagecache pages
Apr 14 09:19:17 gpu kernel: [1942072.386039] 32 pages in swap cache
Apr 14 09:19:17 gpu kernel: [1942072.386040] Swap cache stats: add 493338025, delete 493337993, find 204125073/346446887
Apr 14 09:19:17 gpu kernel: [1942072.386041] Free swap = 28418196kB
Apr 14 09:19:17 gpu kernel: [1942072.386042] Total swap = 33463292kB
Apr 14 09:19:17 gpu kernel: [1942072.386043] 8365811 pages RAM
Apr 14 09:19:17 gpu kernel: [1942072.386043] 0 pages HighMem/MovableOnly
Apr 14 09:19:17 gpu kernel: [1942072.386044] 151839 pages reserved
Apr 14 09:19:17 gpu kernel: [1942072.386045] 0 pages cma reserved
Apr 14 09:19:17 gpu kernel: [1942072.386046] 0 pages hwpoisoned
python3 process is killed because of OOM.
Kernel OOM ? GPU OOM ? There's no mention of the GPU nor its memory, nor the system's memory. I don't see how we can help more, not to mention the lack of any log as requested by @kdavis-mozilla
It should be the GPU OOM. The GPU used situation when failing to output model that day is in the picture.
Please avoid images for console output
This is not GPU going OOM, this is your system.
Hi, I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.
This is what I have observed below:
** LOG1 - transfer learning ***
[Running Command] <<<< ['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '8', '--dev_files', '/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '0.8006789096436977', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1280', '--n_steps', '128', '--report_count', '64', '--test_batch_size', '8', '--test_files', '/config/data/dev.csv', '--train_batch_size', '8', '--train_files', '/config/data/train.csv', '--validation_step', '1', '--summary_dir', '/log/summary']
logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'> Preprocessing ['/config/data/train.csv'] Preprocessing done Preprocessing ['/config/data/dev.csv'] Preprocessing done I STARTING Optimization I Training epoch 55... I Training of Epoch 55 - loss: 23.294167 I Validating epoch 55... I Validation of Epoch 55 - loss: 18.385086 I Training epoch 56... . . . . I Validating epoch 69... I Validation of Epoch 69 - loss: 10.784154 I Early stop triggered as (for last 4 steps) validation loss: 10.784154 with standard deviation: 0.303692 and mean: 10.175885 I FINISHED Optimization - training time: 10:17:31 Preprocessing ['/config/data/dev.csv'] Preprocessing done Computing acoustic model predictions... ** CRASHED HERE even test the same csv as dev.csv ***
***** LOG2 - new model ****
[Running Command] <<<< ['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '24', '--dev_cached_features_path', '/log/dev.cached', '--dev_files', '/speech-data/zh-cn/dev_all.csv,/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '1.0', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1536', '--n_steps', '128', '--report_count', '64', '--show_progressbar', 'False', '--test_batch_size', '24', '--test_cached_features_path', '/log/test.cached', '--test_files', '/speech-data/zh-cn/test_all.csv', '--train_batch_size', '24', '--train_cached_features_path', '/log/train.cached', '--train_files', '/config/data/train.csv', '--validation_step', '2']
logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'>Preprocessing ['/config/data/train.csv'] Saving to /log/train.cached Preprocessing done Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv'] Saving to /log/dev.cached Preprocessing done I STARTING Optimization I Training epoch 0... I Training of Epoch 0 - loss: 114.681121 I Training epoch 1... I Training of Epoch 1 - loss: 112.573020 I Training epoch 2... I Training of Epoch 2 - loss: 111.198199 I Validating epoch 2... I Validation of Epoch 2 - loss: 132.356261 I Training epoch 3... I Training of Epoch 3 - loss: 111.352473 I Training epoch 4... I Training of Epoch 4 - loss: 110.923097 I Validating epoch 4... I Validation of Epoch 4 - loss: 134.511714 I Training epoch 5... I Training of Epoch 5 - loss: 110.544768 I Training epoch 6... **** CRASHED HERE
I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.
Please document what happens precisely. Not just "crashed here", actual error ...
I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.
Please document what happens precisely. Not just "crashed here", actual error ...
Sorry, the "crashed here" means there is nothing more to print, so I couldn't find more logs to show, the pid seems to be killed by sys and subprocess exit code is -9. google says it OOM error, I didn't keep tracking the RAM Usage, But I remember when epoch at 2 or 3, the free space of CPU RAM (include swap) still have 50% more. I will try to record the RAM v.s. time table and report here.
My training environment was in nvidia-docker cuda9.0 cudnn7 ubuntu16, and I executed deepspeech.py by python subprocess.run.
Hi everyone, Here is my training log today, I insert the memory log during training per 5 mins and remove some redundant info.
Here are few summaries:
grep -i commit /proc/meminfo
:
CommitLimit: 157183852 kB (~149.9G)
Committed_AS: 11557960 kB (~11.G)sudo sysctl -w vm.overcommit_memory=1
, but when 04/25 06:20 (UTF+0), the memory usage increased to ~60G, the system killed my kernel again, should I do more command to refresh vm.overcommit_memory?sysctl -a
:
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.numa_stat = 1
vm.numa_zonelist_order = Node
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 70
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
sysctl: permission denied on key 'vm.stat_refresh'
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.watermark_scale_factor = 10
vm.zone_reclaim_mode = 0
* RAM: 32G
* SWAP: 128G
>>>>>>>> Log Start <<<<<<<<
**** [Running Command] ****
['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '24', '--dev_files', '/speech-data/zh-cn/dev_all.csv,/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '1.0', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1536', '--n_steps', '128', '--report_count', '64', '--test_batch_size', '24', '--test_files', '/speech-data/zh-cn/test_all.csv', '--train_batch_size', '24', '--train_files', '/config/data/train.csv', '--validation_step', '2']
logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'>
[Memory] 04/24 06:43
memtype total used free buffers cached
Virtual 31.3G 1.2G 3.5G 0.2G 26.4G
Swap 128.0G 1.7G 126.3G
Preprocessing ['/config/data/train.csv']
[Memory] 04/24 07:03
total used free buffers cached
Virtual 31.3G 27.7G 0.3G 0.2G 3.1G
S 128.0G 1.8G 126.2G
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 5...
[Memory] 04/24 07:13
memtype total used free buffers cached
Virtual 31.3G 29.7G 0.4G 0.0G 1.2G
Swap 128.0G 6.5G 121.5G
[Memory] 04/24 07:18
memtype total used free buffers cached
Virtual 31.3G 29.7G 0.4G 0.0G 1.2G
Swap 128.0G 6.6G 121.4G
.
.
.
[Memory] 04/24 09:53
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 17.3G 110.7G
I Training of Epoch 5 - loss: 100.668000
I Training epoch 6...
.
.
.
[Memory] 04/24 12:28
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 29.5G 98.5G
[Memory] 04/24 12:43
memtype total used free buffers cached
Virtual 31.3G 26.6G 0.4G 0.0G 4.2G
Swap 128.0G 30.8G 97.2G
I Training of Epoch 6 - loss: 75.468284
I Validating epoch 6...
I Validation of Epoch 6 - loss: 54.855226
I Training epoch 7...
[Memory] 04/24 12:48
memtype total used free buffers cached
Virtual 31.3G 26.2G 0.8G 0.0G 4.2G
Swap 128.0G 31.4G 96.6G
[Memory] 04/24 12:53
memtype total used free buffers cached
Virtual 31.3G 26.5G 0.4G 0.0G 4.3G
Swap 128.0G 31.4G 96.6G
.
.
.
[Memory] 04/24 14:28
memtype total used free buffers cached
Virtual 31.3G 26.8G 0.3G 0.0G 4.2G
Swap 128.0G 33.0G 95.0G
>>>>> Now, The Kernel has been killed, So my program re run the training <<<<<<
* I lost syslog here
Preprocessing ['/config/data/train.csv']
[Memory] 04/24 14:43
memtype total used free buffers cached
Virtual 31.3G 8.0G 6.8G 0.2G 16.2G
Swap 128.0G 0.2G 127.8G
[Memory] 04/24 14:58
memtype total used free buffers cached
Virtual 31.3G 27.6G 0.3G 0.2G 3.2G
Swap 128.0G 0.2G 127.8G
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 7...
[Memory] 04/24 15:13
memtype total used free buffers cached
Virtual 31.3G 29.8G 0.4G 0.0G 1.1G
Swap 128.0G 4.5G 123.5G
.
.
.
[Memory] 04/24 17:48
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 14.6G 113.4G
I Training of Epoch 7 - loss: 59.782993
I Training epoch 8...
[Memory] 04/24 20:33
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 27.2G 100.8G
[Memory] 04/24 20:38
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 27.7G 100.3G
I Training of Epoch 8 - loss: 54.081356
I Validating epoch 8...
I Validation of Epoch 8 - loss: 39.263215
I Training epoch 9...
.
.
.
[Memory] 04/24 22:30
memtype total used free buffers cached
Virtual 31.3G 26.9G 0.2G 0.0G 4.2G
Swap 128.0G 30.7G 97.3G
>>>>> Now, The Kernel has been killed, This is sys log with 'oom' keyword below <<<<<<
* my txt log time is UTC+0, and sys log is UTC+8
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252242] gsd-color invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252256] oom_kill_process+0x220/0x440
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252346] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027262] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027276] oom_kill_process+0x220/0x440
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027361] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 25 06:31:57 yihuachiu-GTX1080 kernel: [594677.436088] oom_reaper: reaped process 29918 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
==========================sys log end ================
>>>>> So my program rerun the training again <<<<<<
Preprocessing ['/config/data/train.csv']
[Memory] 04/24 22:35
memtype total used free buffers cached
Virtual 31.3G 5.3G 16.8G 0.2G 9.1G
Swap 128.0G 0.5G 127.5G
[Memory] 04/24 22:55
memtype total used free buffers cached
Virtual 31.3G 30.9G 0.2G 0.0G 0.1G
Swap 128.0G 0.9G 127.1G
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 9...
[Memory] 04/25 01:15
memtype total used free buffers cached
Virtual 31.3G 28.7G 0.2G 0.0G 2.4G
Swap 128.0G 12.3G 115.7G
.
.
.
[Memory] 04/25 04:30
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.4G 0.0G 4.2G
Swap 128.0G 29.7G 98.3G
I Training of Epoch 10 - loss: 47.904078
I Validating epoch 10...
[Memory] 04/25 04:35
memtype total used free buffers cached
Virtual 31.3G 26.5G 0.5G 0.0G 4.2G
Swap 128.0G 30.3G 97.7G
I Validation of Epoch 10 - loss: 33.457280
I Training epoch 11...
.
.
.
[Memory] 04/25 06:20
memtype total used free buffers cached
Virtual 31.3G 26.7G 0.3G 0.0G 4.2G
Swap 128.0G 32.1G 95.9G
>>>>> Now, The Kernel has been killed, This is sys log with 'oom' keyword below <<<<<<
* my txt log time is UTC+0, and sys log is UTC+8
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852000] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852016] oom_kill_process+0x220/0x440
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852101] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187847] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187864] oom_kill_process+0x220/0x440
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187951] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 25 14:22:33 yihuachiu-GTX1080 kernel: [622912.355795] oom_reaper: reaped process 31842 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
==========================sys log end ================
Preprocessing ['/config/data/train.csv']
[Memory] 04/25 06:45
memtype total used free buffers cached
Virtual 31.3G 30.6G 0.3G 0.1G 0.3G
Swap 128.0G 1.4G 126.6G
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 12...
[Memory] 04/25 07:35
memtype total used free buffers cached
Virtual 31.3G 29.7G 0.4G 0.0G 1.2G
Swap 128.0G 6.4G 121.6G
[Memory] 04/25 08:00
memtype total used free buffers cached
Virtual 31.3G 29.6G 0.4G 0.0G 1.3G
Swap 128.0G 7.6G 120.4G
the kernel would be killed until memory usage increased to ~60G
You said your system had 128GB swap ? You have not documented how much RAM you have, not which processes are using it ...
@mychiux413 Also please use code formatting for console output, it's very hard to read and follow otherwise.
I've edited the console output you mentioned.
Here are some updates:
sysctl
didn't help, I tried the command with params below, which made the memory usage more stable:
docker run \
--memory-swappiness 100 \
--memory <YOUR RAMSIZE> \
--memory-swap <YOUR RAM + SWAP SIZE> \
--oom-kill-disable \
--runtime=nvidia -it <YOUR DeepSpeech IMAGE> bash
--train_cached_features_path
didn't help, so the pod will be removed until the RAM exploded, but English model still trainable in kubernetes (English's RAM usage was stable even larger dataset).v0.5.0-alpha.7
, the memory usage seems to be more stable than v0.4.1
, the RAM doesn't increase over time, the --feature_cache
works too, I'm keep tracking now...Now, I'm trying version
v0.5.0-alpha.7
, the memory usage seems to be more stable thanv0.4.1
, the RAM doesn't increase over time, the--feature_cache
works too, I'm keep tracking now...
Does it holds ? Can we close the issue ?
Now, I'm trying version
v0.5.0-alpha.7
, the memory usage seems to be more stable thanv0.4.1
, the RAM doesn't increase over time, the--feature_cache
works too, I'm keep tracking now...Does it holds ? Can we close the issue ?
Training "Chinese" dataset (~50GB) with v0.5.0-alpha.7
still failed at testing phase.
Even I disable the oom-killer and enable the --feature_cache
, docker runtime would still stuck in testing phase (w/o OOM Message), only docker stop <container-id>
could shut it down.
The final solution to me is just skip the testing after early stop triggered, and everything will work fine then.
@mychiux413 @reuben Added a fix for this here. So I'd try with 0.5.0-alpha.9 or later.
@mychiux413 @reuben Added a fix for this here. So I'd try with 0.5.0-alpha.9 or later.
I've tried 0.5.0-alpha.11 with ~120GB dataset, the testing did work, so this issue could be closed, thanks for your help.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I use DeepSpeech to train Chinese model.The preprocessing and training are OK.And then it is going to output .pb model.But when computing acoustics model predictions, the python3 process is killed because of OOM.
The parameters are followed.There are more than 2400 lines in test csv.The audio files in test csv are about 1GB. When I only use first 100 lines of test csv and retry to output .pb model, everything is OK.
I want to know the reason of this situation.And will small test csv(first 100 lines) influence the accuracy of model? Please tell me the reason if you know that. Thanks!!
parameter
CPU and GPU memory use When output .pb model,6GB CPU memory is used(maximum is 32GB)
Almost all GPU memory are used.But when I use small test csv to ouput model successfully, the situation of GPU memory used is the same as failure(using big 2400 lines test csv)