Open withsmilo opened 6 years ago
I think that TaskExecutor's PredictionCache increased continuously because PredictionCache called PredictionCache::insert_entry() only, not PredictionCache::evict_entries().
Evict entry should be called on put
when the cache buffer is full. But the cache size of 0.
I’ll investigate it.
@simon-mo : Our team has met this error case on the real environment. Anyway, I'm continuing to debug this error.
Add a new task to queue.
[info] [TASKEXE...] Adding task to queue. QueryID: 32, model: sum-model:1
This code removed a new task. So batch
returned by get_batch()
has zero size.
https://github.com/ucbrise/clipper/blob/e0b12200ac8b3aeadd8a0ce037e6a96790b4eaab/src/libclipper/include/clipper/task_executor.hpp#L146
So we got this error message.
[error] [TASKEXE...] ModelQueue returned empty batch for model sum-model:1, replica 1f5fa694c905
PredictionCache::put
didn't be called because the new task didn't be executed.
I suggest that when removing tasks with elapsed deadlines, we have to remove the related <key, CacheEntry> item from PredictionCache also.
@simon-mo : This patch is just workaround.
diff --git a/src/libclipper/include/clipper/task_executor.hpp b/src/libclipper/include/clipper/task_executor.hpp
index 3f4f4710..840363fb 100644
--- a/src/libclipper/include/clipper/task_executor.hpp
+++ b/src/libclipper/include/clipper/task_executor.hpp
@@ -145,10 +145,12 @@ class ModelQueue {
std::shared_ptr<ModelContainer> requesting_container,
std::function<BatchSizeInfo(Deadline)> &&get_batch_size) {
std::unique_lock<std::mutex> lock(queue_mutex_);
- remove_tasks_with_elapsed_deadlines();
+ // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+ // remove_tasks_with_elapsed_deadlines();
queue_not_empty_condition_.wait(
lock, [this]() { return !queue_.empty() || !valid_; });
- remove_tasks_with_elapsed_deadlines();
+ // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+ // remove_tasks_with_elapsed_deadlines();
std::vector<PredictTask> batch;
if (requesting_container->is_active() && valid_) {
@withsmilo Is this issue handled in one of your recent PRs?
@rkooo567 No! I'm just applying the workaround patch to Clipper now because this issue is so complicated. After completing opened PRs, I will dig for this.
To reproduce this bug, try below steps.
Step1 : Start hello-world sample (cache_size=0, slo_micros=100)
Step2 : Login to QueryFrontend
Step3 : Check current memory usage. (Approx. 19 MiB)
Step4 : Send many requests to the QueryFrontend
Step5 : Check that all the requests met SLO
Step5 : Check current memory usage. (Approx. 39 MiB)