fix OOM while processing gRPC completion state

Shikugawa commented 3 years ago

The reason for the OOM was that the ProcessingState pointer passed to CompletionState was not released properly. As a result of the fix, the memory pressure is as follows The load test was done as follows for bookinfo deployed in GKE.

before

fortio load -H ${COOKIE} -qps 800 -t 6m -c 4 https://${AUTHORITY_NAME}/productpage

Screenshot from 2021-10-08 18-02-19

after

fortio load -H ${COOKIE} -qps 6000 -t 6m -c 4 https://${AUTHORITY_NAME}/productpage

Screenshot from 2021-10-08 18-02-07

istio-testing commented 3 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Shikugawa To complete the pull request process, please assign incfly after the PR has been reviewed. You can assign the PR to them by writing /assign @incfly in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/istio-ecosystem/authservice/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

incfly commented 3 years ago

I read the gRPC doc for the async C++ impl. https://grpc.io/docs/languages/cpp/async/ And here is the sample server, with their Process implementation

https://github.com/grpc/grpc/blob/v1.41.0/examples/cpp/helloworld/greeter_async_server.cc#L86

You can see official example handle the event as CREATE, PROCESS, FINISH, and only create a new instance of the processing state with new when status is PROCESS branch. Because at that point, the current instance of the state is assigned to handle the current request.

However, in our code, processingstate::process is always invoking parent_.create() to new an instance. This means when the state is either CREATE or PROCESS we will create a new instance of process.

Compared with official e.g, we create more instances.

Therefore I think that's the root cause of the OOM.

Shikugawa commented 3 years ago

You can see official example handle the event as CREATE, PROCESS, FINISH, and only create a new instance of the processing state with new when status is PROCESS branch. Because at that point, the current instance of the state is assigned to handle the current request.

I don't think so. It is because our code and official example has completely the same semantics in state processing. It seems that we have difference of CREATE state. But we have already done CREATE handling on the constructor of ProcessingState.

However, in our code, processingstate::process is always invoking parent_.create() to new an instance. This means when the state is either CREATE or PROCESS we will create a new instance of the process.

In our code, the problem that you said won't occur. Let's consider the allocation sequence of processing 2 requests. The graph attached below is the tree relationship of the processing/completion state of our implementation.

Image from iOS (1)

First, The ProcessingState (1) was created here. In this constructor, it will invoke CheckRequest same as CREATE of hello_greeter_example.
Second, Receive a request. It will create ProcessingState (2) that will be used to handle the next request and CompletionState (1).
Third. CompletionState (1) will release ProcessingState (1) and ProcessingState (V2) and self instance. It seems ProcessingState (2) is dangling. But it will be used on the next request.

This is why our code shouldn't be leaked.

Shikugawa commented 3 years ago

Sorry. The graph above has the wrong point. (1') processing state V2 that created at the initial phase won't be released after completion_state_->Proceed() called. (But It is not related with OOM of course)

incfly commented 3 years ago

While it's still unclear why this PR (having v2 v3 pointer initialized and delete all the v2 v3 pointers) solves the problem, I am able to verify this does fix the OOM issue... Merging for now.

now even after undrestanding the memory allocation deletion flow, main question is still either v2 or v3, completionState does initiates the pointers correctly and the CompleteState::Proceed check if (v2/v3) will ensure that corresponding v2 v3 object will be deleted.

istio-ecosystem / authservice

fix OOM while processing gRPC completion state #167

before

after