microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Copy-ServiceFabricApplicationPackage to image store never finishes #602

Open jdvor opened 5 years ago

jdvor commented 5 years ago

I'm unable to deploy Service Fabric application to Azure Service Fabric cluster; specifically to copy the deployment package to image store using Powershell scripts from Service Fabric SDK.

Azure Service Fabric cluster: New single node cluster (VM is Ubuntu 16.04) Service Fabric version: 6.4.639.1 Nothing has yet been deployed on it. Both Azure blade and Service Fabric Explorer shows node as green / ready and no ongoing upgrades.

Client machine: Microsoft Azure Service Fabric SDK 3.3.654 Using Powershell to deploy the app: C:\Program Files\Microsoft SDKs\Service Fabric\Tools\PSModule\ServiceFabricSDK\Publish-NewServiceFabricApplication.ps1

Copy-ServiceFabricApplicationPackage (L:247) hangs and never proceeds any further beyond several K transfered. I've added -ShowProgress -ShowProgressIntervalMilliseconds 3000 to see more details. It shows that it transfered some bytes and than stops (usually 2-4K).

So far I've tried (on client machine):

I don't know how to SSH into the VM yet, but once I'll found out I will add any relevant details. I've also quickly browsed through cluster logs which are shoveled to Azure Table Storage, but so far did not find anything interesting or pertaining to image store.

olitomlinson commented 5 years ago

Can confirm the same issue.

I've tried multiple dev environments on different machines (including VS2017 and VS2019) and different service fabric solutions that I know were deploying just fine on Friday.

Deployment hangs indefinitely on Copying application to image store...

Cluster version : 6.4.654.9590 Code version : 6.4.638

jdvor commented 5 years ago

It has succeeded this morning. I'm closing the ticket as this is likely an infrastructure problem and not something in code.

olitomlinson commented 5 years ago

Yes this is working for me also this morning!

@jdvor I don't believe you should close this. Please can you reopen.

There was clearly a problem that affected our ability to deploy packages to the cluster. If there was a live incident with the software and we needed to roll out an urgent fix to satisfy our customers, we would have been unable to do so, which is not acceptable.

@masnider can you shed any light on what happened please?

jdvor commented 5 years ago

@olitomlinson ok, re-opening.

I'm guessing it would be more effective to report through Azure Portal technical support request in case of infrastructure issues (especially if one have paid support). Just let me know if this issue belongs here or not.

olitomlinson commented 5 years ago

Thanks so much @jdvor :)

My gut tells me this is a regression as it happened before after a cluster upgrade roll out, which is why my first reaction is to see if someone from the SF Team has a probable cause before I go through the long-winded and painful process of raising a Support Ticket for an incident that on paper is already 'resolved'.

What do you recommend @masnider ? Thanks!

madisvel commented 5 years ago

I have tried for three days already, but it's failing all the time for me. Did you guys have any luck with support tickets?

olitomlinson commented 5 years ago

@madisvel I never got round to raising an issue with Azure Support. Let's just say me and Azure Support are not seeing eye-to-eye lately, and I am reluctant to engage with them.

Anyway, this is happening again for me now after using it fine for days. So frustrating!!

madisvel commented 5 years ago

I finally managed to upload full application. Before I was using VS publish which compresses folders into zip archive, but then I tried with just "Package". This does not zip the package and it uploaded and replicated successfully (however the uplpad is very slow).

olitomlinson commented 5 years ago
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath $path -CompressPackage -SkipCopy

Copy-ServiceFabricApplicationPackage -ApplicationPackagePath $path -ApplicationPackagePathInImageStore MyAppV1

I'm using Powershell directly to package and compress the app as above.

Its been stuck on the second statement for about 20 minutes now trying to upload a 175mb package.

Any ideas how I can check the progress of the upload?

I'm pretty confident this is not environmental as it does the same on my two dev environments. I've even tried using a different internet connection in case it was my broadband supplier. But with no luck.

I can still administer the cluster just fine through Powershell and use my SF applications that are already running on the cluster.

It just seems to not want to upload an application package at very random times. Very very frustrating.

jdvor commented 5 years ago

@olitomlinson if you add -ShowProgress -ShowProgressIntervalMilliseconds 3000 as parameters to Copy-ServiceFabricApplicationPackage it should show a progress bar and transfered KB count.

olitomlinson commented 5 years ago

Ah yes thanks @jdvor I did this and it uploads very slowly. Sometimes it completes, sometimes it just hangs. I can't figure out a pattern.

Definitely an issue somewhere!

madisvel commented 5 years ago

@olitomlinson try without -CompressPackage flag.

jdvor commented 5 years ago

We've just been hit hard by this issue in the most inopportune moment.

jdvor commented 5 years ago

@olitomlinson try without -CompressPackage flag.

Does not changes anything. It usually gets ad far as 8KB to serveral MB, than stops.

olitomlinson commented 5 years ago

@jdvor it can't be coincidence that you and I have both been affected today, as we both were 20 days ago when you opened this case. I wonder if we are sharing the same underlying infrastructure/host.

Either way, I'm going to raise an Azure Support Case, but I'm about to go on vacation so I won't be able to progress it for a few weeks.

jdvor commented 5 years ago

@olitomlinson it's quite possible. We are deploying to West-Europe DC. Azure Service Fabric cluster has nodes with Ubuntu 16.04. Currently we are considering writing our own deployment scripts bypassing upload to image store (uploading to blob storage and provisioning from there; if that's possible - I don't know yet). And to be honest also abandoning SF as we are close to go-live and we can't realistically afford to have 2-3 day window when we can't deploy every couple of weeks.

tudorsibiu90 commented 5 years ago

Any updates? We have been suffering for months like this. It happens in the morning, totally random

jdvor commented 5 years ago

Some useful information I've gathered from Azure support. In the end it was all inconclusive, but it might help you anyway.

From analysis done by someone from SFC product group:

The trace shows the error FABRIC_E_GATEWAY_NOT_REACHABLE around the time reported in incident. This error could be resulted by high load in IPC transport between fabricgateway and fabric.

I was not able to get a clear info what exactly is meant by "fabricgateway", because I thought everything is hosted by our Azure VMs and if we are not doing anything and the cluster is empty, how could there be "high load in IPC"? Perhaps there are sill some shared infrastructure components...

To reduce the load in the same transport, 6.5 RTO (completed but not released yet) will have an improvement by routing file transfer to another channel. The next version would address this issue

So the PG is doing something about that in next release, which is scheduled (unconfirmed) for end of May.

Also, when reporting the issue it is important to try to capture network traffic from your side using tools like Network Monitor or Wireshark. This is something you have to plan ahead and be prepared to do when the issue arises.

We have decided at the end to abandon SFC apps/hosting, so I will not be following this issue anymore.

olitomlinson commented 5 years ago

Thanks for the update @jdvor !

nates321 commented 5 years ago

I want to bump this. When trying to deploy applications to a cluster (we have a couple clusters where this happens), this command fails around 1/3 of the times for one of the packages. Normally it takes around 1 to 2 minutes for this package to upload, but when it fails it times out after the 10 minute timeout we set. Also, sometimes the command doesn't actually respect timeout we set and runs until our vsts build timeout is reached.

manu-amiel commented 5 years ago

Same issue for me. I'm trying to deploy since yesterday evening and I get in trouble to. What the hell is that ?

I followed each procedure on this post and I got into same result : I'm stucked. I'm also using a custom PS script because VS wizard definitely sucks...

Should we have to consider moving on AWS for avoiding painful release deployment ?

olitomlinson commented 5 years ago

@dkkapur hey Deep, can this issue get some attention please?

Not being able to copy an application package to the cluster reliably is falling at the first hurdle. Potential Service Fabric customers are likely to loose confidence in the tech and abandon it, which is a real shame!

@nates321 @manu-amiel There is another way to copy an application package to the cluster. As documented here. But the documentation is weak, so you need this extra bit of info from this comment

@dkkapur I actually prefer this pull model, would it make sense to look at the suitability of this being the primary advocated method of copying application packages to a cluster? Sure its more steps than the client directly copying the application package, but a pull model feels better suited.

manu-amiel commented 5 years ago

Hi guys,

@olitomlinson thanks for the informations.

I spent a couple of hours with a MS technician and got some explanations on what it goes wrong. First, open the following link :

http://azurespeed.com/Azure/Upload

Uncheck/Check the proper regions depending on your resource locations. Launch the test. It seems that Internet Providers connections with the Azure network are not always good.

An alternative to deploy when your connection sucks is the following one :

It worked for me but, honestly, it's quite tricky and painful... I hope connections will be better between networks.

Have a good day

jaydavid commented 4 years ago

I don't know if this will be helpful to anyone, but I was facing this issue for a while and eventually figured out what was causing it - at least I believe I did.

I had Docker for Windows and Kubernetes set up locally through Hyper-V as well as a minikube vm for some other projects. My network traffic, as a consequence, was always routed through a network bridge and a vEthernet connection (minikube).

I'm not a network engineer or anything, so I was kind of just poking around there, but I ended up removing the minikube vm, disabling the vEthernet adapters, and un-bridging my wifi connection and it immediately worked.

Hopefully this helps at least one person facing this issue.

Adebeer commented 3 years ago

This issue is a real pain as we frequently have to retry registering packages multiple times.

Part of the issue is that we're a global company spanning multiple regions around the globe. In our case we have 3 global regions and all devs using same Azure DevOps namespace hosted in North America. Problem is that, because we're using hosted Azure agents - these are all hosted in North America (there's no customization around this as far as I'm aware) - so for our clusters hosted in other regions, Copy Package frequently times out.

Our option thus seems to not use azure hosted agents and instead opt for a local agent (although from other users experience, doesn't seem a fool proof solution) and/or perhaps using a workaround to copy packages to a local location first and then deploy from there. All in all, bit of a sub-optimal experience - most of our packages are only ~300MB so they're not even that large. We've set our register/copy azure dev-ops task to timeout after 15 minutes as, when it works it typically takes < 5 minutes - increasing timeout/waiting longer doesn't seem to achieve anything other than blocking build agents for other builds.

2021-07-26T03:15:23.7787726Z Using ImageStoreConnectionString='fabric:ImageStore' 2021-07-26T03:26:41.4738544Z ##[debug]Re-evaluate condition on job cancellation for step: 'Register Package in SF Image Store'. 2021-07-26T03:26:51.6488922Z ##[error]The operation was canceled. 2021-07-26T03:26:51.6496177Z ##[debug]System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at Microsoft.VisualStudio.Services.Agent.Util.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, InputQueue1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Agent.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, InputQueue1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Agent.Worker.Handlers.DefaultStepHost.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, CancellationToken cancellationToken) at Microsoft.VisualStudio.Services.Agent.Worker.Handlers.PowerShell3Handler.RunAsync() at Microsoft.VisualStudio.Services.Agent.Worker.TaskRunner.RunAsync() at Microsoft.VisualStudio.Services.Agent.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)