komuw / wiji

Wiji is an asyncio distributed task processor/queue.
MIT License
4 stars 1 forks source link

container exit with code 137(oom): memory leak #71

Closed komuw closed 5 years ago

komuw commented 5 years ago

journalctl -k | grep -i -e memory -e oom

May 18 15:20:25 ubuntu kernel: Freeing unused kernel image memory: 1868K
May 18 21:29:35 wijisqs kernel: apport invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
May 18 21:29:35 wijisqs kernel:  oom_kill_process+0x220/0x440
May 18 21:29:35 wijisqs kernel:  out_of_memory+0x2d1/0x4f0
May 18 21:29:36 wijisqs kernel: [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 18 21:29:36 wijisqs kernel: Out of memory: Kill process 6223 (wiji-cli) score 552 or sacrifice child
May 18 21:29:36 wijisqs kernel: oom_reaper: reaped process 6223 (wiji-cli), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

https://success.docker.com/article/what-causes-a-container-to-exit-with-code-137

komuw commented 5 years ago

this occured on one occasion while running the benchmarks: https://github.com/komuw/wiji-benchmarks

komuw commented 5 years ago

see; https://github.com/komuw/wijisqs/issues/47

komuw commented 5 years ago

We need to run wiji for a long period (say 24 or 48 hrs) under load and see if we can replicate this.

We also need to do memory profiling to make sure we are not retaining of large objects and thus preventing them been garbage collected.

komuw commented 5 years ago

I just tried to do some memory profiling. It is complicated to do this in Python-land, I just wish I was using Golang's pprof tool

komuw commented 5 years ago

the oom occured again this time using the redis broker; https://github.com/komuw/wiji-benchmarks/blob/master/benchmarks/redis_broker.py

komuw commented 5 years ago

I've been running wiji-cli with 6 different task types(IO-bound, mem-bound, cpu-bound) under load for the past 1hr and wiji seems to be running consistently at around 69MB memory use. The memory used doesn't seem to be 'running away'.

wiji-cli-mem

I'll try and run wiji for longer(24hrs/48hrs) and see how the memory profile looks like.

komuw commented 5 years ago

here's the profile after 15hrs of running.

wiji-cli-mem2

6hrs after starting to run, memory grows continuously . It started off at 40MB and then at the top end it is at 500MB

komuw commented 5 years ago

using https://github.com/komuw/wiji/compare/issues/71?expand=1 for memory profiling

Top 15 lines
wiji_cli           | total_allocated_size: 3373.6KiB
wiji_cli           | index:#1: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563
wiji_cli           |     offending_line: setattr(conditional_lib, attr, getattr(lib, attr))
wiji_cli           | index:#2: file:/usr/local/lib/python3.7/site-packages/redis/_compat.py:58 stat_size:384.2KiB stat_count:6
wiji_cli           |     offending_line: return sock.recv(*args, **kwargs)
wiji_cli           | index:#3: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:177.3KiB stat_count:1
wiji_cli           |     offending_line: self.store[queue_name].append(item)
wiji_cli           | index:#4: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80
wiji_cli           |     offending_line: + _seg_78()
wiji_cli           | index:#5: file:/usr/local/lib/python3.7/inspect.py:2217 stat_size:96.2KiB stat_count:162
wiji_cli           |     offending_line: sigcls=sigcls)
wiji_cli           | index:#6: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394
wiji_cli           |     offending_line: for attr in dir(lib):
wiji_cli           | index:#7: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388
wiji_cli           |     offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)
wiji_cli           | index:#8: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.6KiB stat_count:272
wiji_cli           |     offending_line: for param in parameters))
wiji_cli           | index:#9: file:/usr/local/lib/python3.7/site-packages/redis/client.py:1567 stat_size:40.6KiB stat_count:94
wiji_cli           |     offending_line: return self.execute_command('BRPOP', *keys)
wiji_cli           | index:#10: file:/usr/local/lib/python3.7/concurrent/futures/thread.py:57 stat_size:36.9KiB stat_count:206
wiji_cli           |     offending_line: result = self.fn(*self.args, **self.kwargs)
wiji_cli           | index:#11: file:/usr/local/lib/python3.7/abc.py:127 stat_size:35.9KiB stat_count:257
wiji_cli           |     offending_line: _abc_init(cls)
wiji_cli           | index:#12: file:/usr/local/lib/python3.7/abc.py:143 stat_size:31.9KiB stat_count:303
wiji_cli           |     offending_line: return _abc_subclasscheck(cls, subclass)
wiji_cli           | index:#13: file:/usr/local/lib/python3.7/abc.py:135 stat_size:29.6KiB stat_count:245
wiji_cli           |     offending_line: return _abc_register(cls, subclass)
wiji_cli           | index:#14: file:/usr/local/lib/python3.7/site-packages/cryptography/utils.py:41 stat_size:27.4KiB stat_count:309
wiji_cli           |     offending_line: return property(lambda self: getattr(self, name))
wiji_cli           | index:#15: file:/usr/local/lib/python3.7/inspect.py:2859 stat_size:22.1KiB stat_count:346
wiji_cli           |     offending_line: params = tuple(param for param in self.parameters.values()
wiji_cli           | num_other_offenders:3107 combined_stat_size:1716.9KiB

I'll let the profile run overnight

komuw commented 5 years ago

I re-ran wiji overnight using the patch in https://github.com/komuw/wiji/compare/issues/71?expand=1;

mem3

and tracemalloc:

Top 15 lines
total_allocated_size: 2705.1KiB
index:#1: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563
     offending_line: setattr(conditional_lib, attr, getattr(lib, attr))
index:#2: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80
     offending_line: + _seg_78()
index:#3: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:140.0KiB stat_count:1
     offending_line: self.store[queue_name].append(item)
index:#4: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394
     offending_line: for attr in dir(lib):
index:#5: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388
     offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)
index:#6: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.0KiB stat_count:268
     offending_line: for param in parameters))
index:#7: file:/usr/local/lib/python3.7/abc.py:127 stat_size:36.1KiB stat_count:261
     offending_line: _abc_init(cls)
index:#8: file:/usr/local/lib/python3.7/abc.py:143 stat_size:32.0KiB stat_count:304
     offending_line: return _abc_subclasscheck(cls, subclass)
index:#9: file:/usr/local/lib/python3.7/abc.py:135 stat_size:29.6KiB stat_count:245
     offending_line: return _abc_register(cls, subclass)
index:#10: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:183 stat_size:29.0KiB stat_count:421
     offending_line: yield (suffix, dict(series_labels + list(sample_labels.items())), value)
index:#11: file:/usr/local/lib/python3.7/site-packages/cryptography/utils.py:41 stat_size:27.4KiB stat_count:309
     offending_line: return property(lambda self: getattr(self, name))
index:#12: file:/usr/local/lib/python3.7/inspect.py:2859 stat_size:21.5KiB stat_count:339
     offending_line: params = tuple(param for param in self.parameters.values()
index:#13: file:/usr/local/lib/python3.7/enum.py:170 stat_size:20.0KiB stat_count:105
     offending_line: enum_class = super().__new__(metacls, cls, bases, classdict)
index:#14: file:/usr/local/lib/python3.7/inspect.py:2862 stat_size:18.4KiB stat_count:202
     offending_line: kwo_params = {param.name: param for param in self.parameters.values()
index:#15: file:/usr/local/lib/python3.7/copy.py:76 stat_size:18.4KiB stat_count:112
     offending_line: return copier(x)
num_other_offenders:3192 combined_stat_size:1558.8KiB

after running for 16hrs, there doesn't appear to be a leak. Maybe the act of measuring(using tracemalloc) made bug dissapear.

komuw commented 5 years ago

20hours later and we have tell-tale signs of mem-leak;

mem4

and tracemalloc:

Top 15 lines    
total_allocated_size: 3309.4KiB    
index:#1: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563    
    offending_line: setattr(conditional_lib, attr, getattr(lib, attr))    
index:#2: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:404.9KiB stat_count:1    
    offending_line: self.store[queue_name].append(item)    
index:#3: file:/usr/local/lib/python3.7/site-packages/redis/_compat.py:58 stat_size:256.1KiB stat_count:4    
    offending_line: return sock.recv(*args, **kwargs)    
index:#4: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80    
    offending_line: + _seg_78()    
index:#5: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394    
    offending_line: for attr in dir(lib):    
index:#6: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388    
    offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)    
index:#7: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.0KiB stat_count:268    
    offending_line: for param in parameters))    
index:#8: file:/usr/local/lib/python3.7/abc.py:127 stat_size:36.1KiB stat_count:261    
    offending_line: _abc_init(cls)    
index:#9: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:183 stat_size:33.3KiB stat_count:484    
    offending_line: yield (suffix, dict(series_labels + list(sample_labels.items())), value)    
index:#10: file:/usr/local/lib/python3.7/site-packages/cli/utils/_producer.py:11 stat_size:32.1KiB stat_count:52    
    offending_line: await task.delay(*args, **kwargs)    
index:#11: file:/usr/local/lib/python3.7/abc.py:143 stat_size:32.0KiB stat_count:304    
    offending_line: return _abc_subclasscheck(cls, subclass)    
index:#12: file:/usr/local/lib/python3.7/site-packages/wiji/worker.py:354 stat_size:31.0KiB stat_count:51    
    offending_line: await self.run_task(*task_args, **task_kwargs)    

from that tracemalloc; https://github.com/komuw/wiji/blob/dbece4bfaf8fb2903674c3801c1837d1bdc6aec1/wiji/broker.py#L109-L111
looks like we are appending to a dict and maybe never popping stuff out of it.

komuw commented 5 years ago

22hrs later and https://github.com/komuw/wiji/blob/dbece4bfaf8fb2903674c3801c1837d1bdc6aec1/wiji/broker.py#L109-L111 has jumped to the top with the most allocations at a size of 455.6KiB

Top 15 lines   
total_allocated_size: 3341.7KiB    
index:#1: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:455.6KiB stat_count:1    
    offending_line: self.store[queue_name].append(item)    
index:#2: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563    
    offending_line: setattr(conditional_lib, attr, getattr(lib, attr))    
index:#3: file:/usr/local/lib/python3.7/site-packages/redis/_compat.py:58 stat_size:256.1KiB stat_count:4    
    offending_line: return sock.recv(*args, **kwargs)    
index:#4: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80    
    offending_line: + _seg_78()    
index:#5: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394    
    offending_line: for attr in dir(lib):    
index:#6: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388    
    offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)    
index:#7: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.0KiB stat_count:268    
    offending_line: for param in parameters))    
index:#8: file:/usr/local/lib/python3.7/abc.py:127 stat_size:36.1KiB stat_count:261    
    offending_line: _abc_init(cls)    
index:#9: file:/usr/local/lib/python3.7/site-packages/wiji/worker.py:354 stat_size:34.1KiB stat_count:56    
    offending_line: await self.run_task(*task_args, **task_kwargs)    
index:#10: file:/usr/local/lib/python3.7/abc.py:143 stat_size:32.0KiB stat_count:304    
    offending_line: return _abc_subclasscheck(cls, subclass)    
index:#11: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:143 stat_size:29.9KiB stat_count:473    
    offending_line: labelvalues = tuple(unicode(labelkwargs[l]) for l in self._labelnames)    
index:#12: file:/usr/local/lib/python3.7/abc.py:135 stat_size:29.6KiB stat_count:245    
    offending_line: return _abc_register(cls, subclass)    
index:#13: file:/usr/local/lib/python3.7/site-packages/cli/utils/_producer.py:11 stat_size:27.8KiB stat_count:45    
    offending_line: await task.delay(*args, **kwargs)    
index:#14: file:/usr/local/lib/python3.7/site-packages/cryptography/utils.py:41 stat_size:27.4KiB stat_count:309    
    offending_line: return property(lambda self: getattr(self, name))    
komuw commented 5 years ago

27hrs

mem6
total_allocated_size: 4644.3KiB
index:#1: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:1665.5KiB stat_count:1
    offending_line: self.store[queue_name].append(item)
index:#2: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563
    offending_line: setattr(conditional_lib, attr, getattr(lib, attr))
index:#3: file:/usr/local/lib/python3.7/site-packages/redis/_compat.py:58 stat_size:320.2KiB stat_count:5
    offending_line: return sock.recv(*args, **kwargs)
index:#4: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80
    offending_line: + _seg_78()
index:#5: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394
    offending_line: for attr in dir(lib):
index:#6: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388
    offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)
index:#7: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.0KiB stat_count:268
    offending_line: for param in parameters))
index:#8: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:143 stat_size:43.9KiB stat_count:697
    offending_line: labelvalues = tuple(unicode(labelkwargs[l]) for l in self._labelnames)
index:#9: file:/usr/local/lib/python3.7/abc.py:127 stat_size:36.1KiB stat_count:261
    offending_line: _abc_init(cls)
index:#10: file:/usr/local/lib/python3.7/inspect.py:2217 stat_size:35.6KiB stat_count:60
    offending_line: sigcls=sigcls)
index:#11: file:/usr/local/lib/python3.7/site-packages/wiji/worker.py:354 stat_size:32.2KiB stat_count:53
    offending_line: await self.run_task(*task_args, **task_kwargs)
index:#12: file:/usr/local/lib/python3.7/abc.py:143 stat_size:32.0KiB stat_count:304
    offending_line: return _abc_subclasscheck(cls, subclass)
index:#13: file:/usr/local/lib/python3.7/abc.py:135 stat_size:29.6KiB stat_count:245
    offending_line: return _abc_register(cls, subclass)
index:#14: file:/usr/local/lib/python3.7/site-packages/cryptography/utils.py:41 stat_size:27.4KiB stat_count:309
    offending_line: return property(lambda self: getattr(self, name))
index:#15: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:183 stat_size:21.9KiB stat_count:319
    offending_line: yield (suffix, dict(series_labels + list(sample_labels.items())), value)
num_other_offenders:3161 combined_stat_size:1626.0KiB
komuw commented 5 years ago

after 36hrs, the container got killed(OOM)

mem7
Top 15 lines
total_allocated_size: 6330.0KiB
index:#1: file:/usr/local/lib/python3.7/site-packages/wiji/broker.py:111 stat_size:3376.8KiB stat_count:1
    offending_line: self.store[queue_name].append(item)
index:#2: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:92 stat_size:407.8KiB stat_count:2563
    offending_line: setattr(conditional_lib, attr, getattr(lib, attr))
index:#3: file:/usr/local/lib/python3.7/site-packages/redis/_compat.py:58 stat_size:320.2KiB stat_count:5
    offending_line: return sock.recv(*args, **kwargs)
index:#4: file:/usr/local/lib/python3.7/site-packages/idna/uts46data.py:8204 stat_size:152.2KiB stat_count:80
    offending_line: + _seg_78()
index:#5: file:/usr/local/lib/python3.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:90 stat_size:94.7KiB stat_count:1394
    offending_line: for attr in dir(lib):
index:#6: file:/usr/local/lib/python3.7/abc.py:126 stat_size:75.2KiB stat_count:388
    offending_line: cls = super().__new__(mcls, name, bases, namespace, **kwargs)
index:#7: file:/usr/local/lib/python3.7/inspect.py:2800 stat_size:44.0KiB stat_count:268
    offending_line: for param in parameters))
index:#8: file:/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py:143 stat_size:40.9KiB stat_count:649
    offending_line: labelvalues = tuple(unicode(labelkwargs[l]) for l in self._labelnames)
index:#9: file:/usr/local/lib/python3.7/site-packages/wiji/worker.py:354 stat_size:37.8KiB stat_count:62
    offending_line: await self.run_task(*task_args, **task_kwargs)
index:#10: file:/usr/local/lib/python3.7/abc.py:127 stat_size:36.1KiB stat_count:261
    offending_line: _abc_init(cls)
index:#11: file:/usr/local/lib/python3.7/abc.py:143 stat_size:32.0KiB stat_count:304
    offending_line: return _abc_subclasscheck(cls, subclass)
index:#12: file:/usr/local/lib/python3.7/abc.py:135 stat_size:29.6KiB stat_count:245
    offending_line: return _abc_register(cls, subclass)
index:#13: file:/usr/local/lib/python3.7/inspect.py:2217 stat_size:28.5KiB stat_count:48
    offending_line: sigcls=sigcls)
index:#14: file:/usr/local/lib/python3.7/site-packages/cryptography/utils.py:41 stat_size:27.4KiB stat_count:309
    offending_line: return property(lambda self: getattr(self, name))
index:#15: file:/usr/local/lib/python3.7/site-packages/cli/utils/_producer.py:11 stat_size:21.6KiB stat_count:35
    offending_line: await task.delay(*args, **kwargs)
num_other_offenders:3194 combined_stat_size:1605.3KiB
komuw commented 5 years ago

I applied the patch: https://github.com/komuw/wiji/blob/d078763f8a96df14457453aa082d9790af6f67b4/wiji/broker.py#L121 https://github.com/komuw/wiji/blob/d078763f8a96df14457453aa082d9790af6f67b4/wiji/broker.py#L153-L169

found in the PR: https://github.com/komuw/wiji/pull/74

and after running benchmarks for 12hrs;

patch1

wiji is humming along nicely at a near constant 42MB

komuw commented 5 years ago

20hrs later and wiji is still at 42MB

patch2

I think we can consider https://github.com/komuw/wiji/pull/74 to be a good enough fix.