Automatic detection and pausing of failing archives

ibnesayeed commented 8 years ago

An alternate approach to improve the user experience described in #42 would be to automatically detect and pause for T time period if an upstream service is failing and hits K timeouts in consecutive requests.

machawk1 commented 8 years ago

Another idea would be to short-circuit when a threshold is reached derived from your probabilities of the URI being in each archive.

ibnesayeed commented 8 years ago

Another idea would be to short-circuit when a threshold is reached derived from your probabilities of the URI being in each archive.

For that there is top-K archives configuration in place. That K could be dynamic, but will have to think about the heuristics around that.

ikreymer commented 8 years ago

Just saw this now, made a similar comment in #42 :) +1 Some sort of timeout/health system is definitely needed, so that slower archives can be included w/o affecting overall performance.

ibnesayeed commented 8 years ago

This feature is now implemented. This means one culprit will no longer be able to drag every healthy archive in case of longer timeouts (that are necessary for heavily archived resources). Yay! :+1:

Current implementation introduces two new flags. The -F, --tolerance flag sets the number of consecutive failures of any archive that triggers the hibernation for that archive. When the value is set to -1 (default), the automatic hibernation is disabled. The -d, --dormant flag sets the period for how long the archive will remain dormant before it becomes active again after the beginning of the hibernation (default 15 minutes).

When an archive is active again after being dormant or if it responds good after a few failures, but before reaching the failure tolerance threshold, its failure count is reset to zero.

Currently, the decision is made based on the archive for the sake of simplicity of reporting, but there might be cases where only one of the TimeMap or TimeGate endpoints of an archive is misbehaving and the other is healthy. If there are enough such cases, we can alter the implementation to hibernate endpoint URIs rather than archives.

machawk1 commented 8 years ago

/cc @N0taN3rd regarding his work in setting up a Memento test corpus -- the case where an archive is down should be considered as a test pattern.

ibnesayeed commented 8 years ago

/cc @N0taN3rd regarding his work in setting up a Memento test corpus -- the case where an archive is down should be considered as a test pattern.

And one way to create test cases for this would be to allow a custom header/response delay parameter in the URL of the test mock service.

oduwsdl / MemGator

Automatic detection and pausing of failing archives #43