Closed WhitWaldo closed 6 years ago
Following up on this, I'm able to deploy without issue from my development machine within Visual Studio 2017 Enterprise.
I've toggled just about all the options related to this in the release task - I've tried enabling/disabling compression, diff releases, setting the timeout values, not setting them and using the defaults, and manually setting the MaxMessageSize fabric setting on the service fabric instance to 10 MB (per the guidance at https://github.com/Azure/service-fabric-issues/issues/1170).
Looking at the VSTS agent logs on my build machine, it appears that VSTS just stops interacting with the SF instance. In the deployment step, I see the following:
[2018-08-25 00:09:32Z INFO ProcessInvoker] Starting process: [2018-08-25 00:09:32Z INFO ProcessInvoker] File name: 'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe'
[2018-08-25 00:09:32Z INFO ProcessInvoker] Working directory: 'C:\vsts-agent-win-x64-2.134.2_work_tasks\ServiceFabricDeploy_c6650aa0-185b-11e6-a47d-df93e7a34c64\1.7.21' [2018-08-25 00:09:32Z INFO ProcessInvoker] Require exit code zero: 'True' [2018-08-25 00:09:32Z INFO ProcessInvoker] Encoding web name: ; code page: '' [2018-08-25 00:09:32Z INFO ProcessInvoker] Force kill process on cancellation: 'False' [2018-08-25 00:09:32Z INFO ProcessInvoker] Lines to send through STDIN: '0' [2018-08-25 00:09:33Z INFO ProcessInvoker] Process started with process id 1120, waiting for process exit. [2018-08-25 00:09:33Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:34Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:35Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:36Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:37Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:38Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:38Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:39Z INFO JobServerQueue] Try to upload 2 log files or attachments, success rate: 2/2. [2018-08-25 00:09:39Z INFO JobServerQueue] Try to append 6 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 6/6. [2018-08-25 00:09:40Z INFO JobServerQueue] Try to append 5 batches web console lines for record '1e912ee4-6793-4bec-ba67-643980ce67ad', success rate: 5/5. [2018-08-25 00:09:40Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:09:41Z INFO JobServerQueue] Try to append 2 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 2/2. [2018-08-25 00:09:41Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1. [2018-08-25 00:09:42Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:09:43Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:10:08Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:10:54Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:10:55Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:11:20Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 00:12:07Z INFO JobServerQueue] Try to append 1 batches web console lines for record '9aa8df33-d57c-4478-be71-6d27f1eec411', success rate: 1/1. [2018-08-25 03:27:56Z INFO Worker] Cancellation/Shutdown message received.
I cancelled after 3 hours when I noticed that the cluster wasn't registering that an upgrade or file transfer was in progress. Typically though, this would completely deploy within 25 minutes or so, so taking 3 hours was a sign unto itself.
@WhitWaldo we haven't made any change in the VSTS task a week ago. You can confirm this by looking at the task version printed in the VSTS logs. The version should be same as when it was working for you. Let me know if that is not the case. Also, the reason you won't see any state change in cluster is that the release is getting stuck in uploading application package and hence it won't be registered on the cluster as an upgrade. Can you please try with Hosted VS2017 agent pool to see if it works better.
@bishal-pdMSFT Don't know what to tell you - it's been working great deploying from VSTS for over a year and this cluster instance hasn't given me a problem for the last month it's been around either. All of a sudden last week, I can't deploy successfully but 5% of the time, if that.
Works fine if I deploy from my dev machine itself.
I found several issues in the Service Fabric repo that match the issue I'm having, so perhaps it's something more to do with the large application package I've got than VSTS itself, but so far I'm hitting a brick wall on resolving it.
@oanapl it would be better if the investigation starts at SF SDK layer and then moved to VSTS if necessary. I don't have necessary skills to look at SF logs and diagnose first level of issue. @WhitWaldo can you please attach VSTS task logs for the last time when this task worked and the first time when it did not work.
@bishal-pdMSFT I do have such debug-level logs available, but I'd feel better about sharing them (or others) with you directly rather than posting them in this public context and having to redact each one.
I've attached one of the "Deploy Service Fabric Application" step both when it works and when it doesn't. notworking.txt working.txt
@oanapl To address the possible issues you listed at https://github.com/Azure/service-fabric-issues/issues/1170:
@WhitWaldo one thing I noticed from task logs is that SF SDK version is different for the two runs. Successful one ran with SDK version 3.1.274.9494 and failed one ran with SDK 3.2.176.9494. These twi might have run on different agents or SF SDK might have been re-installed on agent. The task versions are same and hence there is no change in task behavior.
@WhitWaldo , the fastest way to make progress is to open a support ticket. We will need Service Fabric traces from the cluster and the client machine (where the VSTS agent is running). If you don't know how to collect them, you will get instructions as part of the ticket processing. If you have the traces, you can also send them to me directly at oanapl at microsoft.com
@WhitWaldo , Hope you are taking up this with SF team. Can we close this issue?
@rgovardhms Yes, I'm currently engaged in a support ticket about this now. This issue can be closed here.
What was the solution here? We've been struggleing with this for months :(
Can you provide more details on the failure you are seeing and consider opening a new issue?
@tudorsibiu90 Ultimately we concluded it was an occasional failure of my network connection. I resolved it by switching from an onsite build server to one hosted in Azure. It was never clear why it suddenly started acting up or why I only ever appeared to have an issue with uploads after a build, but the resolution was an easy enough change.
Troubleshooting
Checkout how to troubleshoot failures and collect debug logs: https://docs.microsoft.com/en-us/vsts/build-release/actions/troubleshooting
Environment
Server - VSTS Account Name: ebiquity-na Team project: Hyperion Build definition: Hyperion Full Cluster Release numbers: Release-185, attempts 1-4 Release-186, attempt 1 (worked on attempt 2) Release-187, attempts 1-7 Release-189 Release-190, attempts 1-2 Release-191 Release-192 Release-193, attempt 1 (worked on attempt 2) Release-195
Agent - Hosted or Private: Private agent, Windows server 2016 standard, agent version 2.138.6 (upgraded after release 192 or so, previously 2.134.2)
Issue Description
Using the Service Fabric deployment task built into VSTS, I've consistently seeing that the deployment works fine up to stage 3 (actual deployment) at which point it gets to "Copying application to image store..." and never moves on. If I specify a timeout for the deployment, it will eventually hit that timeout and fail. If I don't specify a timeout, it'll almost always indefinitely sit on that step (longest time running is 19 hours thus far).
I only have started observing this issue in the last week - I haven't experienced this in the last year of using VSTS and didn't change the definition in any way prior to this starting to happen. I have a support ticket open about it, 118081718812701, but we're not making much headway apart from seeing richer logs still indicating that it gets there and stops.
Halfway through the week, I attempting upgrading to the latest version of Service Fabric and it didn't make a difference. I upgraded the VSTS agent on my build server, I found similar issues raised in this repo and modified my release definition to specifically include the timeouts I already had defined in my cloud.xml file, and to enable diff releases and compressed file copy. No significant change noticed (it worked once after making the change yesterday and I haven't been able to successfully deploy since).
For the last year, a brand new deployment takes about 8 minutes to complete (if the application doesn't exist in the cluster already) and an upgrade takes about 24 minutes to complete. Looking at the service fabric cluster itself, it never registers that anything is attempting to perform an upgrade or copy files, so as far as I can tell, VSTS never actually starts the file transfer. As a result, all my logs indicate that I cancelled after half an hour or longer when I looked at the cluster, didn't see a deployment in progress and it was still pending in VSTS, so I could retry. This didn't really have an impact on the failures.
Error logs
2018-08-25T22:02:51.1333266Z ##[debug]Leaving New-DiffPackage. 2018-08-25T22:02:51.3927998Z Service fabric SDK version: 3.2.176.9494. 2018-08-25T22:04:20.2931505Z ##[debug]Join-Path "C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code" "ServiceFabricServiceModel.xsd" 2018-08-25T22:04:20.3039173Z ##[debug]C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\ServiceFabricServiceModel.xsd 2018-08-25T22:04:20.3069561Z ##[debug]Length: 94 2018-08-25T22:04:20.3099711Z ##[debug] 2018-08-25T22:04:20.3146099Z ##[debug]Test-Path "C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\ServiceFabricServiceModel.xsd" 2018-08-25T22:04:20.3619249Z ##[debug]True 2018-08-25T22:04:20.3651149Z ##[debug] 2018-08-25T22:05:01.5760455Z Copying application to image store... ... And then after half an hour or long of no progress being made ... 2018-08-25T22:35:15.7445675Z ##[debug]Re-evaluate condition on job cancellation for step: 'Deploy Service Fabric Application'. 2018-08-25T22:35:26.2119865Z ##[error]The operation was canceled. 2018-08-25T22:35:26.2216621Z ##[debug]System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at Microsoft.VisualStudio.Services.Agent.Util.ProcessInvoker.d26.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.ProcessInvokerWrapper.d 12.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.Worker.Handlers.DefaultStepHost.d7.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.Worker.Handlers.PowerShell3Handler.d 4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.Worker.TaskRunner.d24.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.VisualStudio.Services.Agent.Worker.StepsRunner.d 1.MoveNext()
2018-08-25T22:35:26.2225561Z ##[section]Finishing: Deploy Service Fabric Application
All my error logs resemble this - I've submitted all the debug errors to the support ticket 118081718812701 and would be happy to email them to someone in particular if you want a copy.
As a result, of the 20ish attempts I've made to update my cluster, only two have been successful. This is a major blocker for me as I cannot reliably deploy anything via VSTS's continuous integration into my cluster and haven't now for over a week. I appreciate you taking a look into this.