Storage upload configuration is likely to DDOS destination

aoanla commented 4 years ago

The current configuration of the "upload to grid" function is dangerous, and likely to cause more issues than it fixes. Blindly "retrying" up to 15 times on failure will usually just result in 15 failures, at the cost of additional load on the destination server.

Improvements:

1) if you're going to retry lots, at least wait explicitly between each try (exponential backoff would be even better, but just waiting several seconds between tries is a significant improvement over the current design).

2) parse the error you get from the transfer tool. Some failures cannot be recovered from, and timeout errors are better dealt with by submitting with increased timeout limits on the transfer, or attempting to copy to a different, less loaded, endpoint.

marianheil commented 4 years ago

Good idea, @JBlack93 already implemented a timeout for hejrun.py in a63a755d159968e7880f3924d65e242e734c89b9, we should port that over to the rest. Since @jcwhitehead is currently rewriting a lot of the copy anyway (see #43) we should add the timeout there.

jcwhitehead commented 4 years ago

Hi @aoanla, thanks for the suggestion. I've implemented an escalating sleep between tries and it's just been merged into master.

Our typical experience has been that a full set of failures is very rare, but that avoidable copy errors (which disappear on a subsequent attempt) are fairly routine.

The risk of simultaneous repeated copy attempts being made by many nodes at once has typically been mitigated by the fact that jobs very rarely start simultaneously, as there's a bottleneck in the rate at which we can submit and the queue is typically busy. The run durations are then randomly distributed with a standard deviation of several hours, so the final copy of results back to the grid storage is highly asynchronous between nodes.

In any case, it is hopefully fixed now - thanks for the input!

scarlehoff commented 4 years ago

@GandalfTheWhite2 @jcwhitehead I'm going to continue the discussion in this issue as I believe is the correct one for this.

Regardless of whether it generates a DDOS or not I would be a bit unhappy, even from a philosophical point of view, with something which says "oh, it failed, try again 15 times" because chances are that you just repeat the error 15 times as @aoanla correctly points out.

Waiting for a timeout is nicer, of course, but I would also get the maxrange down. Just try 2 or 3 times.

Our typical experience has been that a full set of failures is very rare, but that avoidable copy errors (which disappear on a subsequent attempt) are fairly routine.

Is this problem happening in ARC or Dirac?

Has anyone seen any problems with gfal-copy when executed from the command line? Try 100 times.

And the final question is, these failures which are fairly routine... were they seen after moving to gfal-copy? If I remember correctly the retry was originally implemented for lfn-copy or whatever the name which had the very interesting property of returning a "success code" and a "success message" even when the copy had failed. So maybe the whole thing is a fix to a problem that is not there anymore?

All that said, if this is a random problem that happens inconsistently trying 100 times from command line won't expose the problem.

marianheil commented 4 years ago

This should be fixed with #43. I will close the issue. If we find this to still be a problem we can reopen it.

scarlehoff / pyHepGrid

Storage upload configuration is likely to DDOS destination #44