Istio-Proxy leaks memory on ServiceEntry updates

robertpanzer commented 5 years ago

Describe the bug I have a minimal setup of Istio 1.1.0-snapshot.6 with a single Deployment. When I repeatedly create and delete the same ServiceEntry the memory consumption of the Istio-proxy next to my service only goes up to > 200MB.

Expected behavior When ServiceEntries are deleted the memory that was claimed before should be freed again.

Steps to reproduce the bug Install Istio 1.1.0-snapshot.6 and install the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: testcurl
  namespace: sr-test
spec:
  selector:
    matchLabels:
      app: testcurl
  replicas: 1
  template:
    metadata:
      labels:
        app: testcurl
    spec:
      containers:
      - name: testcurl
        image: hub.docker.prod.walmart.com/library/openjdk:8-jdk
        command:
          - sleep
          - "100000"
---
apiVersion: v1
kind: Service
metadata:
  name: testcurl
  namespace: sr-test
  labels:
    app: testserver
spec:
  selector:
    app: testcurl
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80

Now the memory consumption of the Istio-proxy shows this:

CMD                           RSS
/usr/local/bin/pilot-agent  16376
/usr/local/bin/envoy -c /et 31876

Now create a file with this ServiceEntry:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: wmt-subnets
  namespace: default
spec:
  hosts:
  - "*.example.com"
  addresses:
  - 4.100.64.0/19
  - 4.226.0.0/16
  - 4.227.0.0/16
  - 4.228.0.0/16
  - 4.229.0.0/16
  - 4.22.0.0/16
  - 4.230.0.0/16
  - 4.25.0.0/16
  - 4.26.0.0/16
  - 4.27.0.0/16
  - 4.42.0.0/16
  - 4.47.0.0/16
  - 4.15.0.0/16
  - 4.26.0.0/16
  - 4.27.0.0/16
  - 4.28.0.0/16
  - 4.29.0.0/16
  - 4.36.0.0/16
  - 4.38.0.0/16
  - 4.39.0.0/16
  - 4.40.0.0/16
  - 4.47.0.0/16
  - 4.43.0.0/16
  - 4.45.0.0/16
  - 4.54.0.0/16
  - 4.55.0.0/16
  - 4.64.0.0/18
  - 4.64.128.0/18
  - 4.64.192.0/18
  - 4.64.64.0/18
  - 4.9.0.0/16
  - 4.93.0.0/16
  - 5.74.0.0/16
  - 5.75.0.0/16
  - 6.80.0.0/16
  - 7.16.0.0/16
  - 7.21.0.0/16
  - 7.30.0.0/16
  ports:
  - number: 3306
    name: crdbms
    protocol: TCP
  - number: 7000
    name: csdr1
    protocol: TCP
  - number: 7199
    name: csdr2
    protocol: TCP
  - number: 9042
    name: csdr3
    protocol: TCP
  - number: 9160
    name: csdr4
    protocol: TCP
  - number: 9142
    name: csdr5
    protocol: TCP
  - number: 60000
    name: csdr6
    protocol: TCP
  - number: 9000
    name: kafka1
    protocol: TCP
  - number: 9091
    name: kafka2
    protocol: TCP
  - number: 9092
    name: kafka3
    protocol: TCP
  - number: 9093
    name: kafka4
    protocol: TCP
  - number: 8983
    name: solr
    protocol: TCP
  - number: 8000
    name: splunk
    protocol: HTTP
  resolution: NONE
  location: MESH_EXTERNAL

(I admit that this might not be the smallest ServiceEntry, but I experience the same behavior at a smaller scale with smaller, more conventional, ServiceEntries.)

Now repeatedly install and remove this ServiceEntry. After some installs the memory consumption of the Istio-Proxy jumps by ~25MB and this is not returned anymore. I don't send any requests, so the proxy basically has nothing to do from my perspective.

While removing and adding the ServiceEntry I see these memory consumptions:

CMD                           RSS
/usr/local/bin/pilot-agent  16376
/usr/local/bin/envoy -c /et 59712

CMD                           RSS
/usr/local/bin/pilot-agent  16880
/usr/local/bin/envoy -c /et 84848

CMD                           RSS
/usr/local/bin/pilot-agent  16880
/usr/local/bin/envoy -c /et 111248

CMD                           RSS
/usr/local/bin/pilot-agent  16880
/usr/local/bin/envoy -c /et 137912

CMD                           RSS
/usr/local/bin/pilot-agent  16880
/usr/local/bin/envoy -c /et 164312

Even after ~20 minutes the memory consumption is the same.

Version Kubernetes version 1.10.11 on Docker for Mac. Istio 1.1.0-snapshot.6

Installation helm install ~/istio/istio-1.1.0-snapshot.6/install/kubernetes/helm/istio-init --name istio-init --namespace istio-system -f ~/istio/istio-1.1.0-snapshot.6/install/kubernetes/helm/istio/values-istio-auth-mcp.yaml --set global.mtls.enabled=true --set global.tracing.enabled=true --set pilot.traceSampling=100.0 --set tracing.enabled=true --set grafana.enabled=true --set global.useMCP=false

helm install ~/istio/istio-1.1.0-snapshot.6/install/kubernetes/helm/istio --name istio --namespace istio-system -f ~/istio/istio-1.1.0-snapshot.6/install/kubernetes/helm/istio/values-istio-auth-mcp.yaml --set global.mtls.enabled=true --set global.tracing.enabled=true --set pilot.traceSampling=100.0 --set tracing.enabled=true --set grafana.enabled=true --set global.useMCP=false

Environment Running locally on Docker for Mac

Cluster state The script for the cluster state fails on the docker-compose container, so I can't create whatever is expected. However here is what I could get manually:

clusterinfo.zip

Also the logs of galley, pilot and the proxy:

logs.zip

robertpanzer commented 5 years ago

Just retested with 1.1.0-rc.0 and it seems like this is fixed.

robertpanzer commented 5 years ago

Reopening again, I didn't realize that after the installation of 1.1.0-rc.0 pilot didn't start due to the resource requests. Memory consumption seems to rise slower now, but I easily got to >100MB, and that with a single ServiceEntry!

p4ali commented 5 years ago

What even worse is, if keep deleting/adding serviceentry, ultimately the envoy process within the istio-proxy will be restarted.

This can be verified by running the following script: for i in $(seq 1 100); do echo "round $i: add servcieentry"; kubectl apply -f se_test.yml; sleep 10; echo "delete servcieentry";kubectl -n dep-se-test delete serviceentries.networking.istio.io wmt-subnets; kubectl -n dep-se-test exec -it testcurl-85c4c6bd47-xcw97 -c istio-proxy -- ps -e -o pid,ppid,rss,comm --sort -rss; echo ""; done

This happened in 1.10-rc.0

duderino commented 5 years ago

@silentdai could you take a look at this? I'm not sure if it's in the mixer filter or envoy itself. I think it's in envoy.

CC @costinm @PiotrSikora

PiotrSikora commented 5 years ago

Could you check and/or upload config dump from the proxy, before and after? I'm wondering if the add/remove dance results only in new entries being added to the proxy, but never removed:

kubectl exec -it pod -c istio-proxy -- curl -s 127.0.0.1:15000/config_dump > /tmp/config_dump

Also, Envoy uses tcmalloc, so maybe some of this memory is freed, but not released back to the operating system:

kubectl exec -it pod -c istio-proxy -- curl -s 127.0.0.1:15000/stats | grep server.memory

Thanks!

lambdai commented 5 years ago

@robertpanzer Thank you for the detailed report! I will take a look today. @PiotrSikora Thanks for the comment!

lambdai commented 5 years ago

My experiment shows that RSS or server.memory usage was increasing but ceased some how. https://gist.github.com/silentdai/0e887c3c0452a848cc0e788d24bbab73 I don't have theory so far. Will continue investigation tomorrow.

lambdai commented 5 years ago

This is what I collected 10 hours after the previous comment. https://gist.github.com/silentdai/97f80b3ae47543b668b2c4ea2e0993f7#file-gistfile1-txt-L35-L36

server.memory_allocated: 7097904
server.memory_heap_size: 22020096

envoy returned the allocated memory to malloc system (18505040 -> 7097904) while the heap size remains the same (22020096 -> 22020096) as tcmalloc doesn't return memory to OS.

So far the conclusion is that envoy doesn't leak at my side. Maybe I didn't fully reproduce.

@robertpanzer how frequent were you applying service entry changes?

robertpanzer commented 5 years ago

@silentdai I deleted and created the ServiceEntry at an interval of maybe 5 seconds and reached >100 MB rss after maybe 10 or 20 creates. This was certainly only to reproduce the bug.

I figured earlier that my services started becoming unresponsive due to the high memory consumption and that was the smallest and fastest reproducer that I could find. What I want to say is that my approach doesn't directly reflect a realistic deployment history, but it occurred at a slower pace on my local dev machine where I only have 2 or 3 services running, so for clusters I would see this to be even worse.

lambdai commented 5 years ago

@robertpanzer You mentioned you were using istio1.1 snapshot6. Could you verify it again that we deployed the same envoy version by the below command?

$ kubectl exec -it <PODNAME> -c istio-proxy -- curl -s 127.0.0.1:15000/server_info | grep version
 "version": "ea5e8afb79fc1864c244d397192e06513b918f49/1.10.0-dev/Clean/RELEASE/BoringSSL"

I find that envoy doesn't provide heap profling entry while cpu profiling method is there.

With the future patch we do something like below

curl -X POST "localhost:15000/heapprofiler?enable=y" # currently envoy only provides /cpuprofiler
add/delete service entries
curl -X POST "localhost:15000/heapprofiler?enable=n"
export the profling data and find out what code path and what objects were allocated in between.

@PiotrSikora Is it because no one attempted to work on it? Is there substitution heap profiling method? If not I can provide a simple patch tomorrow so that we enable the above flow.

robertpanzer commented 5 years ago

I just retested it with 1.1.0-rc.0 and got this info:

"version": "fc273e117a6ed875c7cb9b3c3251c8d982bc7196/1.10.0-dev/Clean/RELEASE/BoringSSL"

After 21 create and delete cycles the memory consumption of the istio-proxy jumped to >100MB:

/usr/local/bin/pilot-agent  18472
/usr/local/bin/envoy -c /et 106716

Without any further configurations it seems to stay stable there. I am not sure if it considers the memory request of 128Mi.

lambdai commented 5 years ago

Good to know envoy is not deteriorating beyond 128Mi in this case. And now we are on the same page: somehow the peak usage is as large as 128Mi.

BTW could you run "curl -L localhost:15000/stats | grep memory"?

We can learn if 106Mi is fully utilized, or reserved by tcmalloc. It's not critical to me at this moment. However, as the investigation goes on, fighting reserved memory and locating how 106Mi memory is fully allocated would diverge IMHO

duderino commented 5 years ago

Dropping to P1 to cover remaining investigation since it looks like 1.1 rc0 does not have a memory leak

robertpanzer commented 5 years ago

I just retried the sequence again after a fresh reinstall of 1.1.0-rc.0 and the problem looks actually worse than I thought. The memory footprint seems to jump very fast again, up to 30MB per re-apply. Plus the istio-proxy gets killed as soon as it hits the requested resources of 128Mi.

2019-02-27T14:43:29.730724Z     warn    Epoch 0 terminated with an error: signal: killed
2019-02-27T14:43:29.730762Z     warn    Aborted all epochs
2019-02-27T14:43:29.730796Z     info    Epoch 0: set retry delay to 400ms, budget to 8
2019-02-27T14:43:30.008196Z     info    Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection
refused
2019-02-27T14:43:30.131888Z     info    Reconciling retry (budget 8)
2019-02-27T14:43:30.132095Z     info    Epoch 0 starting
2019-02-27T14:43:30.141430Z     info    Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster testcurl.sr-

The curl command gave me this:

server.memory_allocated: 79505608
server.memory_heap_size: 83886080

Also checked that the version of the istio-proxy is correct, it still says:

 "version": "fc273e117a6ed875c7cb9b3c3251c8d982bc7196/1.10.0-dev/Clean/RELEASE/BoringSSL",

In my last test I managed to bring the memory size to 90 MB after about 4 or 5 create/delete cycles.

duderino commented 5 years ago

@robertpanzer but is this a leak where memory grows without bound or does it stabilize? If the later, are you saying that memory usage is too high?

robertpanzer commented 5 years ago

In my tests today the memory usage didn’t exceed 128MB, but it didn’t really stabilize. Instead the envoy process got killed every few create/delete cycles of my service entry. I wonder what that means for accepting and handling requests? I think this is not a hot restart that could potentially continue in flight requests, is it?

duderino commented 5 years ago

@silentdai @PiotrSikora I think we need to set this back to P0. We need to understand the relationship between what @robertpanzer is trying to do and Envoy's memory usage. It's possible that we need to recommend an increased memory limit, but there should be a good explanation for why this is necessary.

lambdai commented 5 years ago

It's only 1 step away from merging this PR in the upstream envoy and back port to istio/envoy. Once we have that it's should be easy to determine which path eats up the memory

duderino commented 5 years ago

@silentdai very cool work on adding the heap profiler to envoy.

@robertpanzer my guess is we'll end up resolving this in an early 1.1.x release. We probably will not pick up a new version of Envoy until immediately after 1.1 ships.

lambdai commented 5 years ago

EDIT: a service point is not always mapping to a listener. It could map to a route. The progress:

It's a known issue that adding and removing service entry will cause heap usage spike. At my side I don't see a leak as it drop down once the service entry is stable. There are attempts to smooth the spike. More is on the way.
Another issue addressed here is envoy restart with no container restart. I don't why the nuance matters. @robertpanzer Can you explain?

robertpanzer commented 5 years ago

@silentdai I don't think that this nuance is important. I just think that it is not good the Istio-proxy gets killed on config updates. There were no requests processed by the proxy at all in my tests, but if I am not wrong I guess that in flight requests would get lost as this doesn't seem to be a hot restart. Having said that, it's probably nicer that the process gets restarted because it could be a chance for doing a hot restart, but afaict from the logs these look differently.

lambdai commented 5 years ago

Should be fixed by https://github.com/envoyproxy/envoy/issues/6252 in istio 1.1.0 rc4

Turning the endpoint on and off is supposed to cause no listener flipping. However, before the fix the unstable marshaling is causing envoy proxy to wrongly determine as config change in listener and expensive listener draining.

The drain would last minutes. When the listener config is pushing frequently, we can expect many draining listener are hanging.

The 4 factors each is linear to the heap usage

number of listeners (at least 13 in above config)
per listener heap usage (~ 1MiB on my config)
draining window (1 minute probably)
listener push frequency (both adding and delete counts)

So if you are flipping 5 times in a draining window, I would expect 5 2 1 MiB * 13 = 130 MiB heap spike. Also CPU is high because listener spawn is cpu-intensive.

@robertpanzer Please use istio 1.1 rc4 to retest. My understanding is that the core fix is in istio-pilot although you encountered issue at envoy. However, I highly suggest you upgrade the whole istio system.

duderino commented 5 years ago

1.1rc4 will be created March 12th

robertpanzer commented 5 years ago

I can still reproduce the restarts of the istio-proxy as long as I delete/create within the drain time window, which is indeed 45 seconds. If I wait for more than 1 minute between delete and the next create memory seems to be reused.

What indeed got better with rc.4 is that if I make small updates to the ServiceEntry, only the changed listeners seem to get drained and recreated.

lambdai commented 5 years ago

Your described behavior is expected. As in rc4 the affected listeners are less than before, the memory spike is should be highly reduced. Is rc4 acceptable from your perspective?

duderino commented 5 years ago

@silentdai Can you explain though why the Envoy process restarts instead of the pod? In other words, what sort of xDS updates would cause Envoy to restart?

lambdai commented 5 years ago

I cannot reproduce the envoy process restart without pod restart. Maybe "envoy process restart" was referring to the fact that pod restart?

lambdai commented 5 years ago

My observation is more horrible: If the envoy process died, for example, killed by killall envoy. The envoy process no longer restart. The istio-proxy is alive while the pod is not serving any app traffic. By app traffic I mean book-review http request in the demo. The reason is that pilot-agent is the main process of istio-proxy

pilot agent is alive so the pod is not restarted
dead envoy won't be restarted by pilot-agent, envoy process is dead forever until pod restart

lambdai commented 5 years ago

@costinm ^^ What is the expected consequence if envoy die silently?

robertpanzer commented 5 years ago

I just tested again, and for me only the Istio-proxy process dies, the container and therefore pilot-agent stays alive. I also checked the pod status and the container restartCount is still 0.

In the logs of istio-proxy I see this when the istio-proxy is killed:

2019-03-15T09:03:56.220651Z     warn    Epoch 0 terminated with an error: signal: killed                                                          
2019-03-15T09:03:56.220715Z     warn    Aborted all epochs
2019-03-15T09:03:56.220839Z     info    Epoch 0: set retry delay to 200ms, budget to 9                                                            
2019-03-15T09:03:56.421053Z     info    Reconciling retry (budget 9)
2019-03-15T09:03:56.421200Z     info    Epoch 0 starting
2019-03-15T09:03:56.422899Z     info    Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster testcurl.sr-test --service-node sidecar~10.1.3.155~testcurl-7c7454f99c-l6hhp.sr-test~sr-test.svc.cluster.local --max-obj-name-len 189 --allow-unknown-fields -l warning --concurrency 2]

For me the restarted envoy is serving traffic again. I guess though that inflight requests will get lost.

In my console I found these logs that could give a hint why Envoy is killed: (It's Docker for Mac, so probably not representative for a real cluster implementation, but what is a real cluster implementation anyway?)

default 10:35:16.272078 +0100   com.docker.hyperkit [ 6004.061715] envoy invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
default 10:35:16.274248 +0100   com.docker.hyperkit [ 6004.064176] envoy cpuset=224bf68c6d39ea20a1b315ba740fecde15d5433cb79bb6611e37a73598be7d20 mems_allowed=0
default 10:35:16.275401 +0100   com.docker.hyperkit [ 6004.065933] CPU: 2 PID: 28933 Comm: envoy Not tainted 4.9.125-linuxkit #1
default 10:35:16.276221 +0100   com.docker.hyperkit [ 6004.067075] Hardware name:  BHYVE, BIOS 1.00 03/14/2014
default 10:35:16.277467 +0100   com.docker.hyperkit [ 6004.067893]  0000000000000000 ffffffffb1420789 ffffa2048bd73de8 ffff942d049fc240
default 10:35:16.278788 +0100   com.docker.hyperkit [ 6004.069137]  ffffffffb11f8d5e 0000000000000000 00000000000003c8 ffffa2048bd73de8
default 10:35:16.280023 +0100   com.docker.hyperkit [ 6004.070466]  ffff942d090fa2a0 0000000000000202 ffffffffb17e71ce ffff942d049fc240
default 10:35:16.280434 +0100   com.docker.hyperkit [ 6004.071699] Call Trace:
default 10:35:16.281261 +0100   com.docker.hyperkit [ 6004.072109]  [<ffffffffb1420789>] ? dump_stack+0x5a/0x6f
default 10:35:16.282109 +0100   com.docker.hyperkit [ 6004.072933]  [<ffffffffb11f8d5e>] ? dump_header+0x78/0x1ed
default 10:35:16.283152 +0100   com.docker.hyperkit [ 6004.073780]  [<ffffffffb17e71ce>] ? _raw_spin_unlock_irqrestore+0x16/0x18
default 10:35:16.284064 +0100   com.docker.hyperkit [ 6004.074823]  [<ffffffffb11a45b5>] ? oom_kill_process+0x83/0x324
default 10:35:16.284948 +0100   com.docker.hyperkit [ 6004.075735]  [<ffffffffb11a4bdc>] ? out_of_memory+0x239/0x267
default 10:35:16.285944 +0100   com.docker.hyperkit [ 6004.076619]  [<ffffffffb11f22ac>] ? mem_cgroup_out_of_memory+0x4b/0x79
default 10:35:16.286991 +0100   com.docker.hyperkit [ 6004.077615]  [<ffffffffb11f646d>] ? mem_cgroup_oom_synchronize+0x26b/0x294
default 10:35:16.288019 +0100   com.docker.hyperkit [ 6004.078662]  [<ffffffffb11f25f4>] ? mem_cgroup_is_descendant+0x48/0x48
default 10:35:16.288993 +0100   com.docker.hyperkit [ 6004.079707]  [<ffffffffb11a4c37>] ? pagefault_out_of_memory+0x2d/0x6f
default 10:35:16.289877 +0100   com.docker.hyperkit [ 6004.080665]  [<ffffffffb10467af>] ? __do_page_fault+0x3c6/0x45f
default 10:35:16.290654 +0100   com.docker.hyperkit [ 6004.081547]  [<ffffffffb17e8808>] ? page_fault+0x28/0x30
default 10:35:16.294938 +0100   com.docker.hyperkit [ 6004.082395] Task in /kubepods/kubepods/burstable/pod6e836375-4700-11e9-8430-025000000001/224bf68c6d39ea20a1b315ba740fecde15d5433cb79bb6611e37a73598be7d20 killed as a result of limit of /kubepods/kubepods/burstable/pod6e836375-4700-11e9-8430-025000000001/224bf68c6d39ea20a1b315ba740fecde15d5433cb79bb6611e37a73598be7d20
default 10:35:16.295813 +0100   com.docker.hyperkit [ 6004.086630] memory: usage 131060kB, limit 131072kB, failcnt 0
default 10:35:16.296751 +0100   com.docker.hyperkit [ 6004.087481] memory+swap: usage 131072kB, limit 131072kB, failcnt 132
default 10:35:16.297682 +0100   com.docker.hyperkit [ 6004.088419] kmem: usage 1220kB, limit 9007199254740988kB, failcnt 0
default 10:35:16.302508 +0100   com.docker.hyperkit [ 6004.089350] Memory cgroup stats for /kubepods/kubepods/burstable/pod6e836375-4700-11e9-8430-025000000001/224bf68c6d39ea20a1b315ba740fecde15d5433cb79bb6611e37a73598be7d20: cache:4780KB rss:125060KB rss_huge:0KB mapped_file:4772KB dirty:0KB writeback:0KB swap:12KB inactive_anon:46272KB active_anon:83564KB inactive_file:0KB active_file:0KB unevictable:0KB
default 10:35:16.303741 +0100   com.docker.hyperkit [ 6004.094175] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
default 10:35:16.305198 +0100   com.docker.hyperkit [ 6004.095539] [27056]  1337 27056     8118     4465      20       5        3           968 pilot-agent
default 10:35:16.306448 +0100   com.docker.hyperkit [ 6004.096867] [28933]  1337 28933    60704    34827      89       3        0           968 envoy
default 10:35:16.307737 +0100   com.docker.hyperkit [ 6004.098117] Memory cgroup out of memory: Kill process 28933 (envoy) score 2010 or sacrifice child
default 10:35:16.309314 +0100   com.docker.hyperkit [ 6004.099494] Killed process 28933 (envoy) total-vm:242816kB, anon-rss:119924kB, file-rss:14612kB, shmem-rss:4772kB

My interpretation is that the operating system simply kills the process because it's cgroup memory limit is exceeded.

And indeed the configuration seems to support this thesis:

istio-proxy@testcurl-7c7454f99c-l6hhp:/$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes 
134217728
istio-proxy@testcurl-7c7454f99c-l6hhp:/$ cat /sys/fs/cgroup/memory/memory.oom_control 
oom_kill_disable 0
under_oom 0

Pilot Agent is probably not killed because killing envoy is sufficient to free enough memory. The OOM killer seems to be selective:

From https://lwn.net/Articles/761118/

To simplify a bit: when the OOM killer is invoked, it tries to pick the process whose demise will free the most memory while causing the least misery for users of the system.

I can imagine though that other Kubernetes/Container runtime implementations work differently and result in the behavior observed by @silentdai

lambdai commented 5 years ago

@robertpanzer Let's use "envoy" to specify the process and "istio-proxy" to specify the container. With your detailed the description I can understand this time :)

So we have understood that it's the oom killer that terminate envoy process. We won't try to reproduce other logic causing envoy died.

Are you still using the service entry config rotating delete and apply?
It's a known problem that envoy could cause heap spike during listener operation (up and down), as calculated above. IMHO adding or deleting service entry is rare in production. So is the issue currently blocking you, or you are fine leave it open just for tracking?

robertpanzer commented 5 years ago

I was just wondering how the process was killed, and it looked to me like I was not alone with this :-D To find out what happens I still did the repeating create and delete.

I agree that this should not happen too often, and it's not blocking me anymore.

lambdai commented 5 years ago

Actionable here and close this issue:

@robertpanzer Please don't create and delete service entry point too frequently. It's known problem and not high priority as we shouldn't see this in production. In case of emergency we can reduce the drain window to mitigate.
I confirm the envoy won't be restarted if I kill it. The pod is not dying. Maybe there is health check setting could mitigate it. I will look into pilot-agent to see if envoy can be spawn up again by pilot-agent.
If the envoy dies, the request from app is supposed to be lost. It should be handled by application. No one can guarantee 0 bug at envoy. However, as state as above, pilot-agent(or health check?) should not leave the sidecar pod in the state that envoy die for ever.

jnodorp commented 5 years ago

Sorry for hijacking this issue but we actually use dynamic service entry creation / deletion in production and have been bitten by this:

We provide a platform on which customer can spin up instances.
We check if these instances can receive traffic from outside the cluster by creating a service entry and then attempting to access the host from within the cluster.

We could drop this check or remove the service doing the check from the service mesh but going forward we planned on using service entries heavily by providing APIs for whitelisting outbound traffic from these instances. Is this something we should plan to implement otherwise?

lambdai commented 5 years ago

@jschlichtholz Sorry for the late reply. It highly depends on how frequent your collection service endpoints change. However, I would always suggest providing a pool that member never change(or changes less infrequently) and do not add/remove member upon health check response. See the envoy philosophy here

istio-policy-bot commented 4 years ago

🚧 This issue or pull request has been closed due to not having had activity in the last 105 days. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio / istio

Istio-Proxy leaks memory on ServiceEntry updates #11937