issue with server load to users goes very high after the system runs for a long time

chenjiayuan commented 8 years ago

when running a large scale test, the system runs properly, until at a point the users stop requesting chunks from caches. Instead, they start to download all chunks from the server. Which makes the serve load to user gradually increases, as if there is no help from cache

chenjiayuan commented 8 years ago

This is because whenever a new user connects to the cache, the pyftplib used by that cache creates a new handler and increase the index by 1 automatically. However at some point, the time needed to traverse these handlers and update connection rates exceeds the duration allocated for rate update (0.01 sec), therefore cause the variables not able to be updated properly.

Current solution: Run the update equations less frequently, say 10 times per second, but make each of the update equation 10x more aggressive (in cache.py: T_rate = .1, T_storage = 1, scale = 3). So that the issue will occur at a much later time.

Future solution: Need some fix on pyftpdlib used in the cache. There should be data structure cleaning when the system is running. So that the number handlers of each cache won't grow indefinitely, which cause the for loop to taking huge amount of time.

chenjiayuan commented 8 years ago

pasting Kangwook's comment of this issue on Slack:

so recall that "T_storage = .01” was the original setup we had. right? This means that each cache runs the “rate_update_optimal()” method every 0.01 second or 100 times per second. This is implemented in the following way: def rate_update_optimal(self, T_period): log_ct = 0 while True: log_ct = log_ct + 1 if log_ct == LOG_PERIOD: log_ct = 0 time.sleep(T_period) That is, each cache runs some algorithm and sleep for 0.01 second and this is repeated forever. However, when the system runs for a while, the actual execution time of the algorithm becomes non-negligible and this makes the effective frequency of the updates lower than what we expected.

For instance, this might happen something like only 50 times per second or so. This wouldn’t be a huge problem if this didn’t create any side effect. However, it actually does create some side effects, which makes this issue tricky.

Now, consider a user. Every 10 seconds, a user will hear from each of the connected caches how many chunks he can download from them (via information channel). What happens when you set scale = 0.3 and T_rate = 0.01 is that. when one requests for some chunks from a cache, running “rate_update_optimal()” function 1000 times (100 times per second, and for 10 seconds) will result in a high enough rate assigned to that user, and this user will request the corresponding number of chunks to that cache (via an FTP request)

However, what happens when the update frequency becomes smaller is this. The rate_update_optimal() function is not run for an enough number of times. Say it’s run only for 500 times (instead of 1000). The assigned rate (or primal_x) for the user might be even less than the rate needed for a single chunk. Therefore, the cache will tell the user “hey, you can request only ZERO chunk from me."

The user will send an “FTP_RETR” request to the cache saying that “give me ZERO chunk." Now, the “ParentCache” comes in to the story. When a cache handles an “FTP_RETR” command, it also updates the corresponding PRIMAL_X. That is, it first calculates the corresponding “PRIMAL_X” based on the number of requested chunks, and then replaces the current value of PRIMAL_X with the result of the calculation.

Therefore, when the rate was lower than that of “1 CHUNK per frame”, the PRIMAL_X will be set as zero. And then this will keep repeating forever. What I found after digging the logs you provided me was that the effective update frequency was indeed around 25, not 100. (after the system is run for a while)

At the beginning of each frame, Primal_x slowly goes up,because caches wants to provide some content to that user. But since it couldn’t run an enough number of updates within 10 seconds, the final value of primal_x at the end of the frame is something less than 1 chunk. Then, the user requests for 0 chunk from the cache, and the cache resets primal_x to 0 since it requested 0 chunk.

👍

kw1jjang / CalVoD

issue with server load to users goes very high after the system runs for a long time #18