Closed surculus12 closed 5 years ago
Hi,
Is the problem still present when using await asyncio.sleep(1)
(which is async) instead of time.sleep(1)
(which is sync)?
Also, can you provide any memory usage information to get an idea of which objects are presents?
Hi
The issue persists when using await asyncio.sleep
.
I am not sure how to give useful information about to the application's memory usage.
What information would be useful for you? Can you point me in the direction of a specific library that I should use?
@surculus12 you can try Pympler
You could also use the tracemalloc
module from the std lib. One way would be to create a timed task that displays the top 10 every 10s or so for instance.
Pympler summary of a ~0.5gb:
[2019-10-01 19:21:08,862: WARNING]: types | # objects | total size
[2019-10-01 19:21:08,863: WARNING]: ============================================ | =========== | ============
[2019-10-01 19:21:08,863: WARNING]: <class 'bytes | 3188 | 128.67 MB
[2019-10-01 19:21:08,864: WARNING]: <class 'dict | 13518 | 22.86 MB
[2019-10-01 19:21:08,864: WARNING]: <class 'list | 3512 | 17.07 MB
[2019-10-01 19:21:08,864: WARNING]: <class 'str | 43406 | 4.78 MB
[2019-10-01 19:21:08,865: WARNING]: <class 'type | 1977 | 1.99 MB
[2019-10-01 19:21:08,865: WARNING]: <class 'code | 13024 | 1.79 MB
[2019-10-01 19:21:08,866: WARNING]: <class 'bytearray | 22 | 948.88 KB
[2019-10-01 19:21:08,866: WARNING]: <class 'tuple | 9242 | 611.65 KB
[2019-10-01 19:21:08,867: WARNING]: <class 'set | 908 | 480.88 KB
[2019-10-01 19:21:08,867: WARNING]: <class 'abc.ABCMeta | 428 | 457.30 KB
[2019-10-01 19:21:08,868: WARNING]: <class 'weakref | 4903 | 383.05 KB
[2019-10-01 19:21:08,868: WARNING]: <class 'faust.types.tuples.ConsumerMessage | 1253 | 254.52 KB
[2019-10-01 19:21:08,868: WARNING]: <class 'int | 8437 | 250.09 KB
[2019-10-01 19:21:08,869: WARNING]: <class 'frozenset | 535 | 213.53 KB
[2019-10-01 19:21:08,869: WARNING]: <class 'builtin_function_or_method | 2706 | 190.27 KB
tracemalloc at one point
[2019-10-01 19:29:23,587: WARNING]: /usr/local/lib/python3.7/json/decoder.py:353: size=42.4 MiB, count=475367, average=93 B
[2019-10-01 19:29:23,593: WARNING]: /usr/local/lib/python3.7/site-packages/aiokafka/record/default_records.py:229: size=12.8 MiB, count=230, average=57.0 KiB
[2019-10-01 19:29:23,599: WARNING]: /usr/local/lib/python3.7/asyncio/streams.py:682: size=4096 KiB, count=3, average=1365 KiB
[2019-10-01 19:29:23,605: WARNING]: /usr/local/lib/python3.7/site-packages/kafka/protocol/types.py:113: size=4096 KiB, count=1, average=4096 KiB
[2019-10-01 19:29:23,610: WARNING]: <frozen importlib._bootstrap>:219: size=1760 KiB, count=27049, average=67 B
[2019-10-01 19:29:23,616: WARNING]: <frozen importlib._bootstrap_external>:525: size=855 KiB, count=9513, average=92 B
[2019-10-01 19:29:23,622: WARNING]: /usr/local/lib/python3.7/abc.py:143: size=304 KiB, count=2997, average=104 B
[2019-10-01 19:29:23,627: WARNING]: /usr/local/lib/python3.7/abc.py:126: size=259 KiB, count=1071, average=247 B
[2019-10-01 19:29:23,628: WARNING]: /usr/local/lib/python3.7/linecache.py:137: size=166 KiB, count=1680, average=101 B
[2019-10-01 19:29:23,637: WARNING]: /usr/local/lib/python3.7/site-packages/idna-2.8-py3.7.egg/idna/uts46data.py:8204: size=152 KiB, count=80, average=1948 B
Then a bit later
[2019-10-01 19:30:47,241: WARNING]: /usr/local/lib/python3.7/json/decoder.py:353: size=112 MiB, count=1261402, average=93 B
[2019-10-01 19:30:47,242: WARNING]: /usr/local/lib/python3.7/site-packages/aiokafka/record/default_records.py:229: size=30.7 MiB, count=552, average=57.0 KiB
[2019-10-01 19:30:47,242: WARNING]: /usr/local/lib/python3.7/asyncio/streams.py:682: size=4096 KiB, count=3, average=1365 KiB
[2019-10-01 19:30:47,242: WARNING]: /usr/local/lib/python3.7/site-packages/kafka/protocol/types.py:113: size=4096 KiB, count=1, average=4096 KiB
[2019-10-01 19:30:47,243: WARNING]: <frozen importlib._bootstrap>:219: size=1760 KiB, count=27049, average=67 B
[2019-10-01 19:30:47,243: WARNING]: <frozen importlib._bootstrap_external>:525: size=855 KiB, count=9511, average=92 B
[2019-10-01 19:30:47,243: WARNING]: /usr/local/lib/python3.7/abc.py:143: size=304 KiB, count=2995, average=104 B
[2019-10-01 19:30:47,244: WARNING]: /usr/local/lib/python3.7/abc.py:126: size=259 KiB, count=1071, average=247 B
[2019-10-01 19:30:47,244: WARNING]: /usr/local/lib/python3.7/linecache.py:137: size=166 KiB, count=1680, average=101 B
[2019-10-01 19:30:47,244: WARNING]: /usr/local/lib/python3.7/site-packages/idna-2.8-py3.7.egg/idna/uts46data.py:8204: size=152 KiB, count=80, average=1948 B
The problem seems to be the messages from kafka. Any reason they wouldn't be caught by garbage collection?
I tried changing the serializer to raw and the memory leak is much less noticable. However, memory used by that one line in aiokafka is still present at about 1MB/10s, which should line up with the raw message sizes. I let it sit for a while and it's up to 200MB (for that file-line alone).
I would think that DefaultRecord
is never being cleared, as it is the only thing I can see that directly references the value
variable here, causing it to not clear up by the gc. That or Faust is fetching the value and keeping the reference somewhere. If anyone knows how to track an object in memory (what is referencing it / why the gc isn't clearing it) or can point me in the right direction that would be great! Calling it a night.
By the way, I've also observed this issue with my local machine (running master-branch Faust also), though it's via pycharm docker interpreter. I'll try not using docker tomorrow in case that's somehow the issue.
Given that no one seems to be having this issue, I will also see if our Kafka is up-to-date in case that is the issue.
Issue observed outside of docker on a debian buster vm.
Upgraded kafka to 2.3.0, still seeing the issue.
gc does not list the objects as unreachable-but-not-freeable (gc.garbage).
Trying to find a way to get more information about the objects and why they aren't being freed. Again, if anyone knows more about that I'd appreciate being pointed in the right direction.
Just wondering, do you encounter the same issue in 1.7.4
Faust version?
I swear I saw this initially way before the 1.8 release but no, I've backed to 1.7.4 and the byte objects are getting collected as you'd expect.
I'll set up the actual project with 1.7.4 to verify, but the stripped-down project from above is leak-free...
Everything seems to be running fine as long as I do not use the json serializer, which seems to be running rampant with its memory usage.
I am in the middle of rewriting part of the project I initially thought was the source of the memory leak (but was just an inefficient library). After that I will run the project overnight to see if everything is stable with 1.7.4, and then double-check that it's not stable with 1.8.0.
Although I could consistently reproduce the issue with the stripped down code above, and the (rather bloated) tasks I was running previously caused memory leaks, my optimized code seems to be running fine on 1.8.
This seems like an insane fluke I was encountering. Unless it affects someone else I don't see a reason to investigate further.
I'll continue to monitor the memory usage but things are just... fine now...
Thanks for the help!
Same issue here :
the blue line is faust 1.7.4, the purple one is 1.8.0 (uptaded from 1.7.4, green line). We are using the json serialiser.
I cannot test 1.9 and above since sasl authentication is broken (we use kafka as a service with confluent cloud), faust breaks at startup.
@gvdmarck Does the issue still persist if you set the topic-wide type to str
and then run it through json.loads
yourself? It's been a while but I do not recall being able to consistently reproduce the issue on 1.8.0.
I have been running the serializer myself since closing the issue and have yet to observe another leakage. Either the issue still persists in some circumstances with the json serializer, or it was fixed soon after.
You may want to create an issue for your auth problems so that you can update if switching the serialization doesn't help.
@surculus12 I don't have the hand on the kafka topic, it has to be json. Leak is even worse on 1.10, not sure why this was closed, it's still there.
@gvdmarck What I meant was the serializer for the Topic
object that consumes the Kafka topic. I observed a memory leak when I had it as json
, but not when it was str
:
topic = app.topic('topic-name', value_serializer='str')
@app.agent(topic)
async def consumer(stream):
async for result_raw in stream:
result = json.loads(result_raw) # Now equivalent to value_serializer='json'
This is no longer a problem for me, and I can use the json serializer as before. I never figured out why it was happening.
I'd recommend attempting to locate what exactly is using the memory. For me, it was the json decoder, which is why I tried different ways of using it. For some reason it was not being collected when Faust was doing the serialization. You can dump the tracemalloc on a @app.timer
as was suggested above, and you should get more insight into what is going on. In particular, you might see if it was the exact same thing that was causing my issue. This may be a completely unrelated memory leak.
Did someone find a solution to this issue, other than sending stringified jsons and doing a json.load
again when we receive them? @surculus12, did you do anything special to prevent the memory increase?
I don't remember doing anything special, no.
Hello @popadi , the only working solution we found was to change faust lib to faust-streaming. Faust is dead/no longer maintained. Faust-streaming is a fork which is active. Library change did not impact anything on the applications, just the import, and no more leaks since then.
Hello @popadi , the only working solution we found was to change faust lib to faust-streaming. Faust is dead/no longer maintained. Faust-streaming is a fork which is active. Library change did not impact anything on the applications, just the import, and no more leaks since then.
Hey. I've been using faust-streaming too for some time, I asked here just in case there's a bug that exists in both places. Earlier this morning I started an agent that does nothing besides iterating and printing the event and somehow the memory kept increasing, even if just by a bit. After thousands of events, it was pretty visible.
When the agent processes the event, the processing function of an event calls run_in_executor
at some point and I think the memory leak can also come from there. I found this open issue: https://bugs.python.org/issue41699
I'll try to debug some more, although I feel like I'm losing my mind.
Checklist
master
branch of Faust.Steps to reproduce
Dockerfile with
A directory faust_bug containing only file faust_app.py with
The container is then deployed as four containers (for the four kafka partitions) inside of a kubernetes cluster via boiler-plate garden.io configurations.
Expected behavior
The container consumes from the topic without leaking memory.
Actual behavior
The container leaks memory, crashing after 2-3 minutes after hitting the 1gb limit.
Full traceback
No error but here are debug logs for a container at ~600mb: https://gist.github.com/surculus12/a8386cfeac54596ce3622a5a898f6e96
Versions