Closed mski-iksm closed 2 years ago
I'm thinking of mitigating the above problem with the measures described in “How to fix the problem”. How do you think? Please review.
@Hi-king @hirosassa @e-mon @ujiuji1259 @kitagry
@mski-iksm Thanks for taking the time to reporting an exhaustive implementation ideas! Looks great to me! Please go ahead to create a PR!
Thanks @mski-iksm
Problem
Using task cache locking prevents cache conflicts even when multiple gokart tasks are running in parallel. However, if multiple tasks running in parallel depend on a same task, read process may cause congestion, and parallelism may be compromised.
The figure below shows a case where the problem occurs. Suppose app A, B and C are running on different nodes and are trying to do a read() of the same task. Although the three apps are running in parallel, the task lock causes congestion.
How to fix the problem
Use different procedures to get locks when writing, reading, and deleting the cache as illustrated in the figures below.
_remove() is the method to remove the cache which occurs when rerun parameter is set to True in the task.
As mentioned in case 4 and 5 below, rerun is prone to cause problems when combined with the cache lock feature. Therefore, we would like to have a warning when rerun is set when the cache redis lock function is turned on.
Why it works
case1: When multiple writing tasks happen at the same time
Suppose, app A and B are both writing tasks.
In such a case, one of the tasks will wait until the other completes its writing process to prevent cache collisions.
case2: When a writing task and reading tasks happen
Suppose, app A is a writing task and apps B and C are the reading tasks. Consider the case where task B and C try to read the cache at almost the same time while task A is executing a write operation.
In this case, tasks B and C will wait for task A to complete writing. This is because before _load(), there is a process that takes lock for a moment and releases it immediately. Since tasks B and C do not hold the lock the whole time while they are reading, but release it immediately, they can start the other _load() without waiting for the completion of either _load(). This is useful when _load() takes a long time.
case3: When a writing task happens after a reading task
Contrary to the previous case, consider the case where app B for writing occurs while app A for reading is running first.
In this case, app B's lock will be taken while app A's _load() is being executed. However, app B has the ability to do an exist check before doing the writing process and skip _dump() if the cache already exists. Therefore, the writing process of app B can be completed without overwriting or deleting the cache that is currently being read by app A.
case4: When a removing task and reading task happen
Suppose, app A is a removing task and apps B is a reading task.
In this case, App B's processing starts after App A's _remove() is completed and the lock is removed. However, when App B tries to read the file, the cache file has already been removed, so an error occurs.
This is the first reason why rerun which causes _remove() and cache lock does not work well.
case5: When a removing task happens while other task is reading
In contrast to the previous case, consider the case where App B executes _remove() while App A executes _load().
In this case, App A's _load() may cause an error depending on the file system because the cache is removed by App B during execution.
This is the second reason why rerun which causes _remove() and cache lock does not work well.