m3dev / gokart

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
https://gokart.readthedocs.io/en/latest/
MIT License
318 stars 57 forks source link

The redis cache lock reduces task parallelism and causes unnecessary latency in execution #265

Closed mski-iksm closed 2 years ago

mski-iksm commented 2 years ago

Problem

Using task cache locking prevents cache conflicts even when multiple gokart tasks are running in parallel. However, if multiple tasks running in parallel depend on a same task, read process may cause congestion, and parallelism may be compromised.

The figure below shows a case where the problem occurs. Suppose app A, B and C are running on different nodes and are trying to do a read() of the same task. Although the three apps are running in parallel, the task lock causes congestion.

image

How to fix the problem

Use different procedures to get locks when writing, reading, and deleting the cache as illustrated in the figures below.

image

_remove() is the method to remove the cache which occurs when rerun parameter is set to True in the task.

As mentioned in case 4 and 5 below, rerun is prone to cause problems when combined with the cache lock feature. Therefore, we would like to have a warning when rerun is set when the cache redis lock function is turned on.

Why it works

case1: When multiple writing tasks happen at the same time

Suppose, app A and B are both writing tasks.

In such a case, one of the tasks will wait until the other completes its writing process to prevent cache collisions.

image

case2: When a writing task and reading tasks happen

Suppose, app A is a writing task and apps B and C are the reading tasks. Consider the case where task B and C try to read the cache at almost the same time while task A is executing a write operation.

In this case, tasks B and C will wait for task A to complete writing. This is because before _load(), there is a process that takes lock for a moment and releases it immediately. Since tasks B and C do not hold the lock the whole time while they are reading, but release it immediately, they can start the other _load() without waiting for the completion of either _load(). This is useful when _load() takes a long time.

image

case3: When a writing task happens after a reading task

Contrary to the previous case, consider the case where app B for writing occurs while app A for reading is running first.

In this case, app B's lock will be taken while app A's _load() is being executed. However, app B has the ability to do an exist check before doing the writing process and skip _dump() if the cache already exists. Therefore, the writing process of app B can be completed without overwriting or deleting the cache that is currently being read by app A.

image

case4: When a removing task and reading task happen

Suppose, app A is a removing task and apps B is a reading task.

In this case, App B's processing starts after App A's _remove() is completed and the lock is removed. However, when App B tries to read the file, the cache file has already been removed, so an error occurs.

This is the first reason why rerun which causes _remove() and cache lock does not work well.

image

case5: When a removing task happens while other task is reading

In contrast to the previous case, consider the case where App B executes _remove() while App A executes _load().

In this case, App A's _load() may cause an error depending on the file system because the cache is removed by App B during execution.

This is the second reason why rerun which causes _remove() and cache lock does not work well.

image

mski-iksm commented 2 years ago

I'm thinking of mitigating the above problem with the measures described in “How to fix the problem”. How do you think? Please review.

@Hi-king @hirosassa @e-mon @ujiuji1259 @kitagry

hirosassa commented 2 years ago

@mski-iksm Thanks for taking the time to reporting an exhaustive implementation ideas! Looks great to me! Please go ahead to create a PR!

hirosassa commented 2 years ago

Thanks @mski-iksm