mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.14k stars 3.94k forks source link

Computing acoustics model predictions error: OOM #2055

Closed ljhzxc closed 5 years ago

ljhzxc commented 5 years ago

I use DeepSpeech to train Chinese model.The preprocessing and training are OK.And then it is going to output .pb model.But when computing acoustics model predictions, the python3 process is killed because of OOM.

The parameters are followed.There are more than 2400 lines in test csv.The audio files in test csv are about 1GB. When I only use first 100 lines of test csv and retry to output .pb model, everything is OK.

I want to know the reason of this situation.And will small test csv(first 100 lines) influence the accuracy of model? Please tell me the reason if you know that. Thanks!!

reuben commented 5 years ago

Check the duration of the audio files in the large test CSV. There's probably a few files that are very long and cause the OOM.

ljhzxc commented 5 years ago

Check the duration of the audio files in the large test CSV. There's probably a few files that are very long and cause the OOM.

The biggest audio file in my train/dev/test csv is no more than 0.6MB. It may not be the reason.

kdavis-mozilla commented 5 years ago

Can you provide a log file?

lissyx commented 5 years ago

python3 process is killed because of OOM.

Kernel OOM ? GPU OOM ? There's no mention of the GPU nor its memory, nor the system's memory. I don't see how we can help more, not to mention the lack of any log as requested by @kdavis-mozilla

ljhzxc commented 5 years ago

Can you provide a log file?

Sorry for my late reply. I don't find other log. The followed content is in the /var/log/kern.log. Apr 14 is just the day I try to output model.

Apr 14 09:19:17 gpu kernel: [1942044.048852] [ 4727]     0  4727 16688962  7485317   19123      46   949207             0 python3
Apr 14 09:19:17 gpu kernel: [1942044.048854] [ 4789]   115  4789     4284      300      14       3       45             0 nvidia-persiste
Apr 14 09:19:17 gpu kernel: [1942044.048856] [ 5059]   120  5059    77313       19      60       3      475          -900 postgres
Apr 14 09:19:17 gpu kernel: [1942044.048857] Out of memory: Kill process 4727 (python3) score 494 or sacrifice child
Apr 14 09:19:17 gpu kernel: [1942044.048983] Killed process 4727 (python3) total-vm:66755848kB, anon-rss:0kB, file-rss:29941268kB
Apr 14 09:19:17 gpu kernel: [1942072.385910] irqbalance invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
Apr 14 09:19:17 gpu kernel: [1942072.385914] irqbalance cpuset=/ mems_allowed=0
Apr 14 09:19:17 gpu kernel: [1942072.385918] CPU: 10 PID: 1584 Comm: irqbalance Tainted: P           OE   4.4.0-62-generic #83-Ubuntu
Apr 14 09:19:17 gpu kernel: [1942072.385920] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
Apr 14 09:19:17 gpu kernel: [1942072.385921]  0000000000000286 0000000092a35c8f ffff8808786b79b0 ffffffff813f7c63
Apr 14 09:19:17 gpu kernel: [1942072.385924]  ffff8808786b7b88 ffff880868460000 ffff8808786b7a20 ffffffff8120ad4e
Apr 14 09:19:17 gpu kernel: [1942072.385926]  ffffffff8113f5ca ffff8808786b7a50 ffffffff811a6b6d ffff8802300a5460
Apr 14 09:19:17 gpu kernel: [1942072.385928] Call Trace:
Apr 14 09:19:17 gpu kernel: [1942072.385933]  [<ffffffff813f7c63>] dump_stack+0x63/0x90
Apr 14 09:19:17 gpu kernel: [1942072.385937]  [<ffffffff8120ad4e>] dump_header+0x5a/0x1c5
Apr 14 09:19:17 gpu kernel: [1942072.385941]  [<ffffffff8113f5ca>] ? __delayacct_freepages_end+0x2a/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385944]  [<ffffffff811a6b6d>] ? do_try_to_free_pages+0x2ed/0x410
Apr 14 09:19:17 gpu kernel: [1942072.385947]  [<ffffffff811926c2>] oom_kill_process+0x202/0x3c0
Apr 14 09:19:17 gpu kernel: [1942072.385949]  [<ffffffff81192ae9>] out_of_memory+0x219/0x460
Apr 14 09:19:17 gpu kernel: [1942072.385952]  [<ffffffff81198a5d>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Apr 14 09:19:17 gpu kernel: [1942072.385954]  [<ffffffff81198e56>] __alloc_pages_nodemask+0x286/0x2a0
Apr 14 09:19:17 gpu kernel: [1942072.385956]  [<ffffffff8113f276>] ? delayacct_end+0x56/0x60
Apr 14 09:19:17 gpu kernel: [1942072.385960]  [<ffffffff811e3f5e>] alloc_pages_vma+0xbe/0x240
Apr 14 09:19:17 gpu kernel: [1942072.385962]  [<ffffffff811d49de>] __read_swap_cache_async+0xee/0x140
Apr 14 09:19:17 gpu kernel: [1942072.385964]  [<ffffffff811d4a56>] read_swap_cache_async+0x26/0x60
Apr 14 09:19:17 gpu kernel: [1942072.385965]  [<ffffffff811d4b95>] swapin_readahead+0x105/0x1b0
Apr 14 09:19:17 gpu kernel: [1942072.385969]  [<ffffffff811c1b16>] handle_mm_fault+0x1196/0x1820
Apr 14 09:19:17 gpu kernel: [1942072.385972]  [<ffffffff8106b4f7>] __do_page_fault+0x197/0x400
Apr 14 09:19:17 gpu kernel: [1942072.385974]  [<ffffffff8106b782>] do_page_fault+0x22/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385978]  [<ffffffff8183a778>] page_fault+0x28/0x30
Apr 14 09:19:17 gpu kernel: [1942072.385979] Mem-Info:
Apr 14 09:19:17 gpu kernel: [1942072.385985] active_anon:7465474 inactive_anon:503708 isolated_anon:1024
Apr 14 09:19:17 gpu kernel: [1942072.385985]  active_file:206 inactive_file:198 isolated_file:0
Apr 14 09:19:17 gpu kernel: [1942072.385985]  unevictable:914 dirty:0 writeback:13 unstable:0
Apr 14 09:19:17 gpu kernel: [1942072.385985]  slab_reclaimable:30486 slab_unreclaimable:47402
Apr 14 09:19:17 gpu kernel: [1942072.385985]  mapped:7485714 shmem:7970208 pagetables:21503 bounce:0
Apr 14 09:19:17 gpu kernel: [1942072.385985]  free:52080 free_pcp:0 free_cma:0
Apr 14 09:19:17 gpu kernel: [1942072.385988] Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.385992] lowmem_reserve[]: 0 1303 32020 32020 32020
Apr 14 09:19:17 gpu kernel: [1942072.385995] Node 0 DMA32 free:125616kB min:2748kB low:3432kB high:4120kB active_anon:922088kB inactive_anon:308776kB active_file:12kB inactive_file:4kB unevictable:472kB isolated(anon):0kB isolated(file):0kB present:1465680kB managed:1385060kB mlocked:472kB dirty:0kB writeback:4kB mapped:925884kB shmem:1230904kB slab_reclaimable:5404kB slab_unreclaimable:5664kB kernel_stack:240kB pagetables:3652kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7511248 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.385999] lowmem_reserve[]: 0 0 30717 30717 30717
Apr 14 09:19:17 gpu kernel: [1942072.386001] Node 0 Normal free:66812kB min:64800kB low:81000kB high:97200kB active_anon:28939808kB inactive_anon:1706056kB active_file:812kB inactive_file:788kB unevictable:3184kB isolated(anon):4096kB isolated(file):0kB present:31981568kB managed:31454928kB mlocked:3184kB dirty:0kB writeback:48kB mapped:29016972kB shmem:30649928kB slab_reclaimable:116540kB slab_unreclaimable:183936kB kernel_stack:5712kB pagetables:82360kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184394568 all_unreclaimable? yes
Apr 14 09:19:17 gpu kernel: [1942072.386005] lowmem_reserve[]: 0 0 0 0 0
Apr 14 09:19:17 gpu kernel: [1942072.386008] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 2*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Apr 14 09:19:17 gpu kernel: [1942072.386016] Node 0 DMA32: 242*4kB (UME) 87*8kB (ME) 67*16kB (ME) 42*32kB (ME) 27*64kB (UME) 14*128kB (E) 25*256kB (UME) 16*512kB (UE) 7*1024kB (UE) 1*2048kB (M) 23*4096kB (M) = 125616kB
Apr 14 09:19:17 gpu kernel: [1942072.386025] Node 0 Normal: 815*4kB (MEH) 250*8kB (MEH) 265*16kB (UMEH) 249*32kB (UMEH) 89*64kB (MEH) 63*128kB (UEH) 39*256kB (UEH) 26*512kB (UMEH) 10*1024kB (UEH) 1*2048kB (U) 0*4096kB = 66812kB
Apr 14 09:19:17 gpu kernel: [1942072.386035] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 14 09:19:17 gpu kernel: [1942072.386036] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 14 09:19:17 gpu kernel: [1942072.386037] 7971282 total pagecache pages
Apr 14 09:19:17 gpu kernel: [1942072.386039] 32 pages in swap cache
Apr 14 09:19:17 gpu kernel: [1942072.386040] Swap cache stats: add 493338025, delete 493337993, find 204125073/346446887
Apr 14 09:19:17 gpu kernel: [1942072.386041] Free swap  = 28418196kB
Apr 14 09:19:17 gpu kernel: [1942072.386042] Total swap = 33463292kB
Apr 14 09:19:17 gpu kernel: [1942072.386043] 8365811 pages RAM
Apr 14 09:19:17 gpu kernel: [1942072.386043] 0 pages HighMem/MovableOnly
Apr 14 09:19:17 gpu kernel: [1942072.386044] 151839 pages reserved
Apr 14 09:19:17 gpu kernel: [1942072.386045] 0 pages cma reserved
Apr 14 09:19:17 gpu kernel: [1942072.386046] 0 pages hwpoisoned
ljhzxc commented 5 years ago

python3 process is killed because of OOM.

Kernel OOM ? GPU OOM ? There's no mention of the GPU nor its memory, nor the system's memory. I don't see how we can help more, not to mention the lack of any log as requested by @kdavis-mozilla

It should be the GPU OOM. The GPU used situation when failing to output model that day is in the picture. image

lissyx commented 5 years ago

Please avoid images for console output

lissyx commented 5 years ago

This is not GPU going OOM, this is your system.

mychiux413 commented 5 years ago

Hi, I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.

This is what I have observed below:

  1. DeepSpeech Version: v0.4.1
  2. I constraint max audio length under 20 seconds
  3. the size of my chinese alphabet is ~7000, which include en character & chinese character (utf8)
  4. My utf8 alphabet has rare character (4 bytes character)
  5. In English model, CPU RAM usage will be stable after preprocessing train/dev (cached feature) , but I noticed that in Chinese Version, the CPU RAM usage will grow up after few train/dev epochs, but I can't get any pattern.
  6. It was not only crashed on testing, I have tried different CPU RAM & batch size: Use larger batch size could suffer few train/dev epochs, but dead sooner than small batch size under same CPU RAM. Use larger CPU RAM will suffer more train/dev epochs, but also dead in the end under same batch size. The longer I trained, the more likely I failed.
  7. I use 128G swap, which works well in English Large Dataset.

** LOG1 - transfer learning ***

[Running Command] <<<< ['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '8', '--dev_files', '/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '0.8006789096436977', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1280', '--n_steps', '128', '--report_count', '64', '--test_batch_size', '8', '--test_files', '/config/data/dev.csv', '--train_batch_size', '8', '--train_files', '/config/data/train.csv', '--validation_step', '1', '--summary_dir', '/log/summary']

logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'> Preprocessing ['/config/data/train.csv'] Preprocessing done Preprocessing ['/config/data/dev.csv'] Preprocessing done I STARTING Optimization I Training epoch 55... I Training of Epoch 55 - loss: 23.294167 I Validating epoch 55... I Validation of Epoch 55 - loss: 18.385086 I Training epoch 56... . . . . I Validating epoch 69... I Validation of Epoch 69 - loss: 10.784154 I Early stop triggered as (for last 4 steps) validation loss: 10.784154 with standard deviation: 0.303692 and mean: 10.175885 I FINISHED Optimization - training time: 10:17:31 Preprocessing ['/config/data/dev.csv'] Preprocessing done Computing acoustic model predictions... ** CRASHED HERE even test the same csv as dev.csv ***

***** LOG2 - new model ****

[Running Command] <<<< ['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '24', '--dev_cached_features_path', '/log/dev.cached', '--dev_files', '/speech-data/zh-cn/dev_all.csv,/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '1.0', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1536', '--n_steps', '128', '--report_count', '64', '--show_progressbar', 'False', '--test_batch_size', '24', '--test_cached_features_path', '/log/test.cached', '--test_files', '/speech-data/zh-cn/test_all.csv', '--train_batch_size', '24', '--train_cached_features_path', '/log/train.cached', '--train_files', '/config/data/train.csv', '--validation_step', '2']

logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'>Preprocessing ['/config/data/train.csv'] Saving to /log/train.cached Preprocessing done Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv'] Saving to /log/dev.cached Preprocessing done I STARTING Optimization I Training epoch 0... I Training of Epoch 0 - loss: 114.681121 I Training epoch 1... I Training of Epoch 1 - loss: 112.573020 I Training epoch 2... I Training of Epoch 2 - loss: 111.198199 I Validating epoch 2... I Validation of Epoch 2 - loss: 132.356261 I Training epoch 3... I Training of Epoch 3 - loss: 111.352473 I Training epoch 4... I Training of Epoch 4 - loss: 110.923097 I Validating epoch 4... I Validation of Epoch 4 - loss: 134.511714 I Training epoch 5... I Training of Epoch 5 - loss: 110.544768 I Training epoch 6... **** CRASHED HERE

lissyx commented 5 years ago

I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.

Please document what happens precisely. Not just "crashed here", actual error ...

mychiux413 commented 5 years ago

I got same issue at Chinese Model, and my English Dataset (~100GB) is almost twice than Chinese one, the English Model never crash during training.

Please document what happens precisely. Not just "crashed here", actual error ...

Sorry, the "crashed here" means there is nothing more to print, so I couldn't find more logs to show, the pid seems to be killed by sys and subprocess exit code is -9. google says it OOM error, I didn't keep tracking the RAM Usage, But I remember when epoch at 2 or 3, the free space of CPU RAM (include swap) still have 50% more. I will try to record the RAM v.s. time table and report here.

My training environment was in nvidia-docker cuda9.0 cudnn7 ubuntu16, and I executed deepspeech.py by python subprocess.run.

mychiux413 commented 5 years ago

Hi everyone, Here is my training log today, I insert the memory log during training per 5 mins and remove some redundant info.

Here are few summaries:

  1. env: DeepSpeech V0.4.1 / Chinese model with ~7000 alphabets / nvidia-docker 16.04-cuda9.0-cudnn7
  2. memory usage increased ~0.1G per 5 mins during training
  3. the kernel would be killed until memory usage increased to ~60G
  4. my commit limit was below grep -i commit /proc/meminfo: CommitLimit: 157183852 kB (~149.9G) Committed_AS: 11557960 kB (~11.G)
  5. At 04/25 04:30 (UTC+0), I set sysctl to disable the oom killer with sudo sysctl -w vm.overcommit_memory=1 , but when 04/25 06:20 (UTF+0), the memory usage increased to ~60G, the system killed my kernel again, should I do more command to refresh vm.overcommit_memory?
  6. This is my sysctl after 04/25 04:30 (UTC+0), sysctl -a:
    vm.nr_hugepages = 0
    vm.nr_hugepages_mempolicy = 0
    vm.nr_overcommit_hugepages = 0
    vm.numa_stat = 1
    vm.numa_zonelist_order = Node
    vm.oom_dump_tasks = 1
    vm.oom_kill_allocating_task = 0
    vm.overcommit_kbytes = 0
    vm.overcommit_memory = 1
    vm.overcommit_ratio = 70
    vm.page-cluster = 3
    vm.panic_on_oom = 0
    vm.percpu_pagelist_fraction = 0
    vm.stat_interval = 1
    sysctl: permission denied on key 'vm.stat_refresh'
    vm.swappiness = 60
    vm.user_reserve_kbytes = 131072
    vm.vfs_cache_pressure = 100
    vm.watermark_scale_factor = 10
    vm.zone_reclaim_mode = 0
  7. My memory spec is:
    * RAM: 32G
    * SWAP: 128G

>>>>>>>> Log Start <<<<<<<<

**** [Running Command] ****
['/root/miniconda3/envs/py36/bin/python', '-u', '/DeepSpeech/DeepSpeech.py', '--alphabet_config_path', '/config/data/alphabet.txt', '--checkpoint_dir', '/log/checkpoint', '--checkpoint_secs', '7200', '--dev_batch_size', '24', '--dev_files', '/speech-data/zh-cn/dev_all.csv,/config/data/dev.csv', '--dropout_rate', '0.2', '--epoch', '-30', '--estop_mean_thresh', '0.3', '--estop_std_thresh', '0.3', '--export_dir', '/export', '--learning_rate', '0.0001', '--limit_test', '64', '--lm_alpha', '1.0', '--lm_beta', '1.85', '--lm_binary_path', '/export/lm.binary', '--lm_trie_path', '/export/trie', '--max_to_keep', '1', '--n_hidden', '1536', '--n_steps', '128', '--report_count', '64', '--test_batch_size', '24', '--test_files', '/speech-data/zh-cn/test_all.csv', '--train_batch_size', '24', '--train_files', '/config/data/train.csv', '--validation_step', '2']

logging at <_io.TextIOWrapper name='/log/deepspeech.txt' mode='a' encoding='UTF-8'>, <_io.TextIOWrapper name='/log/deepspeech-err.txt' mode='a' encoding='UTF-8'>

[Memory] 04/24 06:43
memtype total   used    free    buffers cached
Virtual 31.3G   1.2G    3.5G    0.2G    26.4G
Swap    128.0G  1.7G    126.3G               
Preprocessing ['/config/data/train.csv']

[Memory] 04/24 07:03
        total   used    free    buffers cached
Virtual 31.3G   27.7G   0.3G    0.2G    3.1G
S   128.0G  1.8G    126.2G               
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 5...

[Memory] 04/24 07:13
memtype total   used    free    buffers cached
Virtual 31.3G   29.7G   0.4G    0.0G    1.2G
Swap    128.0G  6.5G    121.5G               

[Memory] 04/24 07:18
memtype total   used    free    buffers cached
Virtual 31.3G   29.7G   0.4G    0.0G    1.2G
Swap    128.0G  6.6G    121.4G               
.
.
.

[Memory] 04/24 09:53
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  17.3G   110.7G               
I Training of Epoch 5 - loss: 100.668000
I Training epoch 6...            
.
.
.            

[Memory] 04/24 12:28
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  29.5G   98.5G                

[Memory] 04/24 12:43
memtype total   used    free    buffers cached
Virtual 31.3G   26.6G   0.4G    0.0G    4.2G
Swap    128.0G  30.8G   97.2G                
I Training of Epoch 6 - loss: 75.468284
I Validating epoch 6...
I Validation of Epoch 6 - loss: 54.855226
I Training epoch 7...

[Memory] 04/24 12:48
memtype total   used    free    buffers cached
Virtual 31.3G   26.2G   0.8G    0.0G    4.2G
Swap    128.0G  31.4G   96.6G                

[Memory] 04/24 12:53
memtype total   used    free    buffers cached
Virtual 31.3G   26.5G   0.4G    0.0G    4.3G
Swap    128.0G  31.4G   96.6G                

.
.
.

[Memory] 04/24 14:28
memtype total   used    free    buffers cached
Virtual 31.3G   26.8G   0.3G    0.0G    4.2G
Swap    128.0G  33.0G   95.0G                

>>>>>   Now, The Kernel has been killed, So my program re run the training <<<<<<
* I lost syslog here

Preprocessing ['/config/data/train.csv']

[Memory] 04/24 14:43
memtype total   used    free    buffers cached
Virtual 31.3G   8.0G    6.8G    0.2G    16.2G
Swap    128.0G  0.2G    127.8G               

[Memory] 04/24 14:58
memtype total   used    free    buffers cached
Virtual 31.3G   27.6G   0.3G    0.2G    3.2G
Swap    128.0G  0.2G    127.8G               
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 7...

[Memory] 04/24 15:13
memtype total   used    free    buffers cached
Virtual 31.3G   29.8G   0.4G    0.0G    1.1G
Swap    128.0G  4.5G    123.5G   

.
.
.

[Memory] 04/24 17:48
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  14.6G   113.4G               
I Training of Epoch 7 - loss: 59.782993
I Training epoch 8...

[Memory] 04/24 20:33
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  27.2G   100.8G  

[Memory] 04/24 20:38
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  27.7G   100.3G               
I Training of Epoch 8 - loss: 54.081356
I Validating epoch 8...
I Validation of Epoch 8 - loss: 39.263215
I Training epoch 9...

.
.
.

[Memory] 04/24 22:30
memtype total   used    free    buffers cached
Virtual 31.3G   26.9G   0.2G    0.0G    4.2G
Swap    128.0G  30.7G   97.3G    

>>>>>   Now, The Kernel has been killed, This is sys log with 'oom' keyword below <<<<<<
* my txt log time is UTC+0, and sys log is UTC+8

Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252242] gsd-color invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252256]  oom_kill_process+0x220/0x440
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594646.252346] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027262] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027276]  oom_kill_process+0x220/0x440
Apr 25 06:31:56 yihuachiu-GTX1080 kernel: [594675.027361] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 25 06:31:57 yihuachiu-GTX1080 kernel: [594677.436088] oom_reaper: reaped process 29918 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
==========================sys log end ================

>>>>>   So my program rerun the training again <<<<<<

Preprocessing ['/config/data/train.csv']

[Memory] 04/24 22:35
memtype total   used    free    buffers cached
Virtual 31.3G   5.3G    16.8G   0.2G    9.1G
Swap    128.0G  0.5G    127.5G

[Memory] 04/24 22:55
memtype total   used    free    buffers cached
Virtual 31.3G   30.9G   0.2G    0.0G    0.1G
Swap    128.0G  0.9G    127.1G               
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 9...

[Memory] 04/25 01:15
memtype total   used    free    buffers cached
Virtual 31.3G   28.7G   0.2G    0.0G    2.4G
Swap    128.0G  12.3G   115.7G  

.
.
.

[Memory] 04/25 04:30
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.4G    0.0G    4.2G
Swap    128.0G  29.7G   98.3G                
I Training of Epoch 10 - loss: 47.904078
I Validating epoch 10...

[Memory] 04/25 04:35
memtype total   used    free    buffers cached
Virtual 31.3G   26.5G   0.5G    0.0G    4.2G
Swap    128.0G  30.3G   97.7G                
I Validation of Epoch 10 - loss: 33.457280
I Training epoch 11...

.
.
.

[Memory] 04/25 06:20
memtype total   used    free    buffers cached
Virtual 31.3G   26.7G   0.3G    0.0G    4.2G
Swap    128.0G  32.1G   95.9G                

>>>>>   Now, The Kernel has been killed, This is sys log with 'oom' keyword below <<<<<<
* my txt log time is UTC+0, and sys log is UTC+8

Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852000] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852016]  oom_kill_process+0x220/0x440
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622905.852101] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187847] python invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187864]  oom_kill_process+0x220/0x440
Apr 25 14:22:31 yihuachiu-GTX1080 kernel: [622910.187951] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Apr 25 14:22:33 yihuachiu-GTX1080 kernel: [622912.355795] oom_reaper: reaped process 31842 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
==========================sys log end ================

Preprocessing ['/config/data/train.csv']

[Memory] 04/25 06:45
memtype total   used    free    buffers cached
Virtual 31.3G   30.6G   0.3G    0.1G    0.3G
Swap    128.0G  1.4G    126.6G               
Preprocessing done
Preprocessing ['/speech-data/zh-cn/dev_all.csv', '/config/data/dev.csv']
Preprocessing done
I STARTING Optimization
I Training epoch 12...

[Memory] 04/25 07:35
memtype total   used    free    buffers cached
Virtual 31.3G   29.7G   0.4G    0.0G    1.2G
Swap    128.0G  6.4G    121.6G               

[Memory] 04/25 08:00
memtype total   used    free    buffers cached
Virtual 31.3G   29.6G   0.4G    0.0G    1.3G
Swap    128.0G  7.6G    120.4G               
lissyx commented 5 years ago

the kernel would be killed until memory usage increased to ~60G

You said your system had 128GB swap ? You have not documented how much RAM you have, not which processes are using it ...

lissyx commented 5 years ago

@mychiux413 Also please use code formatting for console output, it's very hard to read and follow otherwise.

mychiux413 commented 5 years ago

I've edited the console output you mentioned.

Here are some updates:

  1. Because my training environment was on nvidia-docker, so the pid was killed by docker, not by system, so sysctl didn't help, I tried the command with params below, which made the memory usage more stable:
    docker run \
    --memory-swappiness 100 \
    --memory <YOUR RAMSIZE> \
    --memory-swap <YOUR RAM + SWAP SIZE> \
    --oom-kill-disable \
    --runtime=nvidia -it <YOUR DeepSpeech IMAGE> bash
  2. I've trained on cloud with kubernetes, which had no swap setting, so everything should process in RAM, but the RAM usage still increase over time while training Chinese model, even --train_cached_features_path didn't help, so the pod will be removed until the RAM exploded, but English model still trainable in kubernetes (English's RAM usage was stable even larger dataset).
  3. Now, I'm trying version v0.5.0-alpha.7, the memory usage seems to be more stable than v0.4.1, the RAM doesn't increase over time, the --feature_cache works too, I'm keep tracking now...
lissyx commented 5 years ago

Now, I'm trying version v0.5.0-alpha.7, the memory usage seems to be more stable than v0.4.1, the RAM doesn't increase over time, the --feature_cache works too, I'm keep tracking now...

Does it holds ? Can we close the issue ?

mychiux413 commented 5 years ago

Now, I'm trying version v0.5.0-alpha.7, the memory usage seems to be more stable than v0.4.1, the RAM doesn't increase over time, the --feature_cache works too, I'm keep tracking now...

Does it holds ? Can we close the issue ?

Training "Chinese" dataset (~50GB) with v0.5.0-alpha.7 still failed at testing phase. Even I disable the oom-killer and enable the --feature_cache, docker runtime would still stuck in testing phase (w/o OOM Message), only docker stop <container-id> could shut it down. The final solution to me is just skip the testing after early stop triggered, and everything will work fine then.

kdavis-mozilla commented 5 years ago

@mychiux413 @reuben Added a fix for this here. So I'd try with 0.5.0-alpha.9 or later.

mychiux413 commented 5 years ago

@mychiux413 @reuben Added a fix for this here. So I'd try with 0.5.0-alpha.9 or later.

I've tried 0.5.0-alpha.11 with ~120GB dataset, the testing did work, so this issue could be closed, thanks for your help.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.