newrelic / newrelic-python-agent

New Relic Python Agent
https://docs.newrelic.com/docs/agents/python-agent
Apache License 2.0
173 stars 99 forks source link

Significant CPU consumption when agent is on #1177

Open maybephilipp opened 1 month ago

maybephilipp commented 1 month ago

Python agent makes python process to consume a lot of CPU in idle state.

Description When I turn python agent on, upon server start, when first request came and some queries ran, python process starts to eat 40-100% of CPU. If I disable python agent and restart server, the CPU is 0-3%. The high CPU level with python agent keeps the same while being idle (100% sure no requests coming).

Stack: Python, Flask, Gevent (tried both – just PyWSGI server and gunicorn – the situation is exactly same), SocketIO (with simple-websocket), Nginx.

Debugging the app showed that most of CPU time is spent looping green threads. strace showed that a lot of epoll_wait's going on when newrelic is up.

Expected Behavior No impact or low impact.

Troubleshooting or NR Diag results

NR Diag (nrdiag/linux/nrdiag_arm64 --suites python):

Executing following diagnostic task suites: Python Agent

Check Results
-------------------------------------------------
Info     Base/Env/CollectEnvVars [Gathered Environment variables of current shell.]
Success  Base/Config/Collect
Success  Base/Config/Validate
Success  Base/Log/Copy
Success  Base/Config/LicenseKey
Success  Base/Config/ValidateLicenseKey
Info     Base/Config/RegionDetect [1 unique New Relic region(s) detected from config.]
Info     Base/Env/HostInfo [Collected host information]
Success  Base/Env/DetectAWS
Success  Base/Env/RootUser
Success  Base/Log/ReportingTo
Success  Python/Config/Agent
Info     Python/Agent/Version [9.12.0]
Success  Base/Agent/EOL
Success  Base/Config/AppName
Success  Base/Collector/ConnectEU
Info     Base/Config/ValidateHSM [Local High Security Mode setting (false) for confi...]
Success  Python/Env/Dependencies
Success  Python/Env/Version
Success  Python/Requirements/PythonVersion
8 results not shown: 8 None
Local High Security Mode setting (false) for configuration:

    /app/newrelic.ini

To upload results and validate High Security Mode settings, run the Diagnostics CLI with the -a or -api-key flag.

See nrdiag-output.json for full results.

No Issues Found

Creating nrdiag-output.zip

Let me know how can I share the nrdiag-output.zip as I assume there is some sensitive info in it.

py-spy

(hub.py just keeps growing after time)

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   77.81s    77.85s   run (gevent/hub.py:647)
  0.00%   0.00%   0.780s    0.780s   acquire_with_timeout (gevent/_threading.py:36)
  0.00%   0.00%   0.160s    0.160s   load_default_certs (ssl.py:597)
  0.00%   0.00%   0.040s    0.040s   _on_expiration (gevent/timeout.py:242)
  0.00%   0.00%   0.010s    0.010s   send (gevent/_socketcommon.py:725)
  0.00%   0.00%   0.010s    0.010s   __run_task (gevent/threadpool.py:173)
  0.00%   0.00%   0.000s    0.160s   request_encode_url (newrelic/packages/urllib3/request.py:99)
  0.00%   0.00%   0.000s    0.160s   harvest (newrelic/core/application.py:1514)
  0.00%   0.00%   0.000s    0.010s   send (engineio/async_drivers/_websocket_wsgi.py:26)
  0.00%   0.00%   0.000s    0.160s   run (sched.py:151)
  0.00%   0.00%   0.000s    0.160s   _harvest_default (newrelic/core/agent.py:648)
  0.00%   0.00%   0.000s    0.780s   get (gevent/_threading.py:207)
  0.00%   0.00%   0.000s    0.160s   _bootstrap_inner (threading.py:1038)
  0.00%   0.00%   0.000s    0.160s   send_request (newrelic/common/agent_http.py:446)
  0.00%   0.00%   0.000s    0.160s   urlopen (newrelic/packages/urllib3/connectionpool.py:715)
  0.00%   0.00%   0.000s    0.010s   run (gevent/threadpool.py:206)
  0.00%   0.00%   0.000s    0.160s   _validate_conn (newrelic/packages/urllib3/connectionpool.py:1058)

austin-tui

main process:

  OWN    TOTAL    %OWN   %TOTAL  FUNCTION
  35"     35"    100.0%  100.0%  Hub.run (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/hub.py:647)

any other thread:

  OWN    TOTAL    %OWN   %TOTAL  FUNCTION
  00"     05"      0.0%    6.2%  _WorkerGreenlet.run (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/threadpool.py:195)                                                                        │
  00"     05"      0.0%    6.2%  Queue.get (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:207)                                                                                  │
  00"     05"      0.0%    6.2%  _Condition.wait (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:86)                                                                             │
  05"     05"      6.2%    6.2%  acquire_with_timeout (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:36)

cProfile snakeviz

image

The most interesting part: strace

After a lot of time of investigation I found a stackoverflow question (https://stackoverflow.com/questions/58973698/uwsgi-with-gevent-takes-up-100-cpu) which led me to idea of monitoring sys calls and turned out something in my python app just annihilating my server with epoll_wait and getting pid:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.62  117.284295      132975       882       582 futex
  2.98    3.730401           3   1183701           epoll_pwait
  2.87    3.601322           3   1183853           getpid
  0.51    0.644368      107394         6         4 restart_syscall
  0.00    0.002443          29        84           sendto
  0.00    0.002401           9       260        15 recvfrom
  0.00    0.001816           7       255        11 newfstatat
  0.00    0.001388          10       133           write
  0.00    0.000871           2       318        16 read
  0.00    0.000658           5       130       108 epoll_ctl
  0.00    0.000632           2       211           lseek
  0.00    0.000591          19        30           mmap
  0.00    0.000463          17        26           munmap
  0.00    0.000427          10        42           close
  0.00    0.000408          18        22         2 openat
  0.00    0.000346          31        11           ppoll
  0.00    0.000307          18        17         5 connect
  0.00    0.000293          29        10           getdents64
  0.00    0.000240          20        12           socket
  0.00    0.000211           5        37           getsockopt
  0.00    0.000205          17        12         6 accept4
  0.00    0.000175          43         4           clone
  0.00    0.000139           5        26         4 ioctl
  0.00    0.000127           5        22           times
  0.00    0.000119           7        15           setsockopt
  0.00    0.000069           4        16           getsockname
  0.00    0.000063           4        15           rt_sigprocmask
  0.00    0.000058           5        10           getrandom
  0.00    0.000046           5         8           gettid
  0.00    0.000020           6         3           madvise
  0.00    0.000012           6         2           fcntl
  0.00    0.000012           3         4           rseq
  0.00    0.000007           3         2           epoll_create1
  0.00    0.000004           1         4           set_robust_list
  0.00    0.000003           0         4           getpeername
  0.00    0.000000           0         2           getuid
  0.00    0.000000           0         2           geteuid
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         2           recvmsg
  0.00    0.000000           0         1           sendmmsg
------ ----------- ----------- --------- --------- ----------------
100.00  125.274940          52   2370195       753 total

Green threads dump:

https://pastebin.com/mVrqs4VY

Done with:

def report_greens(*args, **kwargs):
    gevent.util.print_run_info()

signal.signal(signal.SIGUSR1, report_greens)

Steps to Reproduce will try to create

Your Environment

Additional context nothing

workato-integration[bot] commented 1 month ago

https://new-relic.atlassian.net/browse/NR-291333

maybephilipp commented 1 month ago

I tried to create a repro repository mimicing what is happaning on my actual code and... maybe I miss something, but repro code doesn't create the same issue. I'm kinda tired and bleary-eyed already as I investigate it for 3 days...

Could someone from newrelic point me to other places for debugging newrelic somehow? Some extra debug logs that could show, for example very frequent execution of some newrelic functionality? My guess is that newrelic's green thread is ticking very frequently... Disabling newrelic on my prod for now... No idea how to solve

ianfitzpatrick commented 1 month ago

Let me know if you'd find a repro useful and I can try to produce one, but I'm also seeing this. It is most pronounced in CPU-constrained environments, I didn't notice it locally, but once I deployed to Google Cloud Run it quickly maxed the CPU.

Invoking Gunicorn like:

gunicorn --pythonpath /my_django_app --chdir /my_django_app --worker-tmp-dir /dev/shm --worker-class gevent --workers=5 --max-requests=1000 --access-logfile '-' --error-logfile '-' --bind 0.0.0.0:8000 wsgi:application

Starting new relic in my wsgi.py file by simply doing:

newrelic.agent.initialize()

Hope that helps. A single request will trigger it in the case of Cloud Run.

I have New Relic turned off for my service that uses gevent worker model until this is resolved, so would be great if it gets figured out!

lrafeei commented 1 month ago

Thanks for the information to help with the reproduction! Looking into this

ianfitzpatrick commented 3 days ago

Hi there -- Any luck looking into this? Would love to get back to instrumenting our gevent model gunicorn processes again.

hmstepanek commented 7 hours ago

Hey I'm picking this ticket up from Lalleh. That get_pid and epoll info is intriguing. I see there is this patch https://github.com/newrelic/newrelic-python-agent/blob/main/newrelic/hooks/coroutines_gevent.py#L17-L29 and I wonder if this could be causing an issue. Could you try running with this branch and see if this fixes the issue?