ucbrise / clipper

A low-latency prediction-serving system
http://clipper.ai
Apache License 2.0
1.4k stars 280 forks source link

[Bug] Memory leak occurred in QueryFrontend when all the user requests met SLO #549

Open withsmilo opened 6 years ago

withsmilo commented 6 years ago

To reproduce this bug, try below steps.

Step1 : Start hello-world sample (cache_size=0, slo_micros=100)

from clipper_admin import ClipperConnection, DockerContainerManager
from clipper_admin.deployers import python as python_deployer

def feature_sum(xs):
    return [str(sum(x)) for x in xs]

clipper_conn = ClipperConnection(DockerContainerManager())
clipper_conn.start_clipper(cache_size=0)  # Disable PredictionCache
clipper_conn.register_application(name="hello-world", input_type="doubles", default_output="-1.0", slo_micros=100)
python_deployer.deploy_python_closure(clipper_conn, name="sum-model", version=1, input_type="doubles", func=feature_sum)
clipper_conn.link_model_to_app(app_name="hello-world", model_name="sum-model")

Step2 : Login to QueryFrontend

$ docker exec -it <QueryFrontend_Container_ID> /bin/bash

Step3 : Check current memory usage. (Approx. 19 MiB)

root@6ddeb1430e8b:/# cat /sys/fs/cgroup/memory/memory.stat | grep -w rss
rss 20230144

Step4 : Send many requests to the QueryFrontend

import requests, json, numpy as np
headers = {"Content-type": "application/json"}
for i in range(10000):
    requests.post("http://localhost:1337/hello-world/predict", headers=headers, data=json.dumps({"input": list(np.random.random(10))})).json()

Step5 : Check that all the requests met SLO

.....
{u'use_default': True, u'output': -1.0, u'query_id': 9996, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9997, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9998, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9999, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
>>>

Step5 : Check current memory usage. (Approx. 39 MiB)

root@6ddeb1430e8b:/# cat /sys/fs/cgroup/memory/memory.stat | grep -w rss
rss 41529344
withsmilo commented 6 years ago

I think that TaskExecutor's PredictionCache increased continuously because PredictionCache called PredictionCache::insert_entry() only, not PredictionCache::evict_entries().

simon-mo commented 6 years ago

Evict entry should be called on put when the cache buffer is full. But the cache size of 0.

https://github.com/ucbrise/clipper/blob/80c97d27a38d60caaebb2a1ae6a995dd7ff1c82d/src/libclipper/src/task_executor.cpp#L78

I’ll investigate it.

withsmilo commented 6 years ago

@simon-mo : Our team has met this error case on the real environment. Anyway, I'm continuing to debug this error.

  1. Add a new task to queue.

    [info]  [TASKEXE...] Adding task to queue. QueryID: 32, model: sum-model:1
  2. This code removed a new task. So batch returned by get_batch() has zero size. https://github.com/ucbrise/clipper/blob/e0b12200ac8b3aeadd8a0ce037e6a96790b4eaab/src/libclipper/include/clipper/task_executor.hpp#L146

  3. So we got this error message.

    [error] [TASKEXE...] ModelQueue returned empty batch for model sum-model:1, replica 1f5fa694c905
  4. PredictionCache::put didn't be called because the new task didn't be executed.

I suggest that when removing tasks with elapsed deadlines, we have to remove the related <key, CacheEntry> item from PredictionCache also.

withsmilo commented 6 years ago

@simon-mo : This patch is just workaround.

diff --git a/src/libclipper/include/clipper/task_executor.hpp b/src/libclipper/include/clipper/task_executor.hpp
index 3f4f4710..840363fb 100644
--- a/src/libclipper/include/clipper/task_executor.hpp
+++ b/src/libclipper/include/clipper/task_executor.hpp
@@ -145,10 +145,12 @@ class ModelQueue {
       std::shared_ptr<ModelContainer> requesting_container,
       std::function<BatchSizeInfo(Deadline)> &&get_batch_size) {
     std::unique_lock<std::mutex> lock(queue_mutex_);
-    remove_tasks_with_elapsed_deadlines();
+    // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+    // remove_tasks_with_elapsed_deadlines();
     queue_not_empty_condition_.wait(
       lock, [this]() { return !queue_.empty() || !valid_; });
-    remove_tasks_with_elapsed_deadlines();
+    // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+    // remove_tasks_with_elapsed_deadlines();

     std::vector<PredictTask> batch;
     if (requesting_container->is_active() && valid_) {
rkooo567 commented 5 years ago

@withsmilo Is this issue handled in one of your recent PRs?

withsmilo commented 5 years ago

@rkooo567 No! I'm just applying the workaround patch to Clipper now because this issue is so complicated. After completing opened PRs, I will dig for this.