ETIMEOUT happens randomly

JeffRausch commented 6 years ago

I saw another issue on this but it was caused by a proxy change and the issue was closed so I'm opening this one. (https://github.com/huserben/TfsExtensions/issues/58).
We have a full stack build that triggers 50+ build configurations in order and takes 45-50 minutes to complete. Most of the time it succeeds but randomly it fails with this error. The past few days I have noticed that the VSTS console responds very slow at times, it seems like the rest api calls may be timing out due to load or something, at least I assume that's what's happening in this situation. Is there any type of retry that could be done or is there a way we could potentially increase the timeout for the api calls? The time between build checks I have set to 5 seconds but I don't think that has anything to do with a call timing out (unless the timeout waiting for an api call to return is based on it).

2018-03-16T09:30:58.0396243Z Build 4328 has not yet completed 2018-03-16T09:30:58.0396969Z Sleeping for 5 seconds... 2018-03-16T09:31:03.1238742Z Build 4328 has not yet completed 2018-03-16T09:31:03.1239464Z Sleeping for 5 seconds... 2018-03-16T09:31:29.2034739Z ##[error]connect ETIMEDOUT 13.107.6.175:443

huserben commented 6 years ago

Hi @JeffRausch

hmm it seems odd that this suddenly is starting to happen. Is it really random or are you somehow able to specify some conditions that are applied when it happens.

Just checking a bit around on the internet about the error, it seems that it's coming from the module that is used to send the request to VSTS. However some answers (e.g. this one on SO are related to proxy settings, so could it be something similar like described in Issue 58?

Other than that, yes I think the timeout for waiting could be configured, however I'm not sure on what the current setting is, as I did not change any default values there.

I will try to look into it a bit more in detail.

JeffRausch commented 6 years ago

It really is random and it's only happened a few times so far. It's not that big of a deal since it's so sparse but when it does happen on a large build it's a little frustrating that it stops the build even though the triggered builds succeed. I wasn't sure if there was anything that could be set on it from your end so I just figured I'd check. Thanks anyway. The agent is running on an AWS ec2 instance and I manage all firewalls/proxy's/etc. so I know none of that has changed.

JeffRausch commented 6 years ago

Sorry, didn't mean to close it

huserben commented 6 years ago

Yeah I see that its becoming an issue if you have builds that fail indeterministically. Especially for long-running ones this is a problem.

I can of course try to set a timeout value in the options, question is what would possibly be a good value for it. Once this is done we can observe a bit and see whether it got fixed or not.

On top of this I could try to improve the error handling a bit, maybe we get some better message on what exactly failed...

huserben commented 6 years ago

Hi Jeff

Sorry I didn't get to look into this issue so far. Just out of curiosity, did it happen again lately or not anymore?

JeffRausch commented 6 years ago

I have not seen the issue this week. I assume it was an underlying issue with VSTS that may have been resolved. A little extra logging or some type of retry around that would be nice to have but it's not a huge issue for me right now. This issue can be closed if you want to push that type of feature off for a while.

Thanks!

huserben commented 6 years ago

Hi

no I'd like to leave it open until I implemented some additional logging, as this will help next time when the issue would start to appear to at least narrow down the underlying problem.

However in that case it won't be too high in my priority list.

Thanks for your feedback, it's very much appreciated.

huserben commented 6 years ago

Hi Jeff

so I just uploaded version 2.7.2 which should have a better handling in case this issue pops up again. I will close the issue now, please feel free to bring it up again in case it starts to happen again and we would see some more info in the log about it.

In the meanwhile I wanted to thank you for your contributions made to the project. If you have any other issues or ideas on how to improve the task please open up a new issue where we can discuss this.

JeffRausch commented 6 years ago

Hi @huserben, Just a quick update... this morning I got an error during one of our builds that almost looked like the timeout issue but I got a different error. I'm not sure if it's the same issue but showing differently due to the logging changes you made or not. The build that it called was successful (took 89 seconds) but the main build failed due to this error.

2018-03-29T09:19:09.0334270Z Waiting for builds to finish - This might take a while... 2018-03-29T09:19:42.9339755Z ##[error]Service Unavailable

huserben commented 6 years ago

Hi

ok, yeah that is possible. I will try with the next version to set a value for the timeout and see if that improves the situation.

What is your "refresh rate" to check whether the build has finished?

JeffRausch commented 6 years ago

I have it set to 5 seconds since I have so many configs being called in a row (57), just to help reduce the overall time it's sitting there doing nothing. I thought about not waiting for them to complete and let them all queue up but then if one fails I don't want any others to build and I want the main build to fail.

huserben commented 6 years ago

Hmm I'm just guessing here, but maybe there is an issue due to the many requests made in that short time. Would it as well be possible for you to set the refresh rate to 30 seconds and give it a try?

You trigger and await your builds one after each other (Trigger Build1, await Build1, trigger Build2, await Build2) or you trigger all of them at the same time and then you just wait for all of them to complete?

huserben commented 6 years ago

Hi @JeffRausch

I uploaded version 2.8.0 of the task just now which includes some, hopefully, improved error handling of the get requests made to the server. Basically it now tries up to 5 times before it fails for good.

Could you let me know in case it still happens or whether you think its solved. I guess you cannot say for sure as its more or less random - but please just give me an update once you got a general feeling whether it might have been solved or not.

JeffRausch commented 6 years ago

Thank you! I'll watch it for a few weeks and let you know if I notice any issues. Is there logging that's done to indicate if the retry logic is being hit? Anything I can watch for to know if the situation is being handled...

huserben commented 6 years ago

In case you would set the system.debug variable I would log every failed attempt. This is because I did not want to pollute the log too much with the retry info. If it fails after 5 tries I'll log the error message of the last try to the regular log, independently on whether it is set to debug or not.

JeffRausch commented 6 years ago

Sounds good. I'll set system.debug for now so I can check to see if it encounters that logic.

huserben commented 6 years ago

Hi there

so did the issue happen again the last 2 weeks or does it seem to be solved by now?

JeffRausch commented 6 years ago

I have not had any issues since then. I do have debug turned on and I don't see any logging indicating retries so I assume the timeout hasn't happened since then. I'll setup a script to check our logs to let me know if it ever encounters the retry logic, even if it succeeds with a retry. I think this issue can be closed and if anything comes up in the future I'll let you know. Thanks!

huserben commented 6 years ago

Allright, yes please just let me know in case the issue is not solved.

huserben / TfsExtensions

ETIMEOUT happens randomly #59