microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

Unable to start SF cluster in Docker on macOS: hosting open failed with MessageExpired #1428

Closed alexeyzimarev closed 5 years ago

alexeyzimarev commented 5 years ago

I have followed the instructions here https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-get-started-mac

When I start the container that I have built, I get the following in the logs:

Drop folder path is set to '/home/ClusterDeployer/ClusterData'
Current machine IP Address is 172.17.0.2
Deleting old data root content
rm: cannot remove 'ClusterManifest.SingleMachine.Replaced.xml': No such file or directory
Setting environment for FabricHost & FabricDeployer to run
Running CoreCLR FabricDeployer on ClusterManifest.SingleMachine.xml
sh: 1: iptables: not found
Starting tracing session
Error: No session daemon is available
Error: Command error
Error: No session daemon is available
Error: Command error
Cleaning up accounts and groups
Setting up accounts and groups
chown: invalid group: ‘sfappsuser:sfappsuser’
./ClusterDeployer.sh: line 187: sudo: command not found
chmod: cannot access '/var/lib/waagent': No such file or directory
./ClusterDeployer.sh: line 191: sudo: command not found
chmod: cannot access '/var/lib/sfcerts': No such file or directory
Running FabricDeployer
sh: 1: iptables: not found
Starting FabricHost
Starting Fabric Host as console application.
Opening Fabric node.
Opening Fabric node.
Opening Fabric node.
hosting open failed with MessageExpired
EventName: NodeOpenFailed Category: StateTransition Node name: N0010 has failed to open with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: c2e9eff19761acc9924422c53c8943d0, dca instance: 131929857634247300, error: MessageExpired
Fabric Node open failed with error code = MessageExpired
Fabric Node open failed with error code = MessageExpired
hosting open failed with MessageExpired
EventName: NodeOpenFailed Category: StateTransition Node name: N0030 has failed to open with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: b9d52c016a15a8f57673d3b8041e2d35, dca instance: 131929857635240270, error: MessageExpired
EventName: NodeAborted Category: StateTransition Node name: N0010 has aborted with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: c2e9eff19761acc9924422c53c8943d0, dca instance: 131929857634247300
Fabric Node open failed with error code = MessageExpired
Fabric Node open failed with error code = MessageExpired
EventName: NodeAborted Category: StateTransition Node name: N0030 has aborted with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: b9d52c016a15a8f57673d3b8041e2d35, dca instance: 131929857635240270
Opening Fabric node.
Opening Fabric node.
hosting open failed with MessageExpired
EventName: NodeOpenFailed Category: StateTransition Node name: N0020 has failed to open with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: cf68563e16a44f808e86197a9cf83de5, dca instance: 131929857635362710, error: MessageExpired
Fabric Node open failed with error code = MessageExpired
Fabric Node open failed with error code = MessageExpired
EventName: NodeAborted Category: StateTransition Node name: N0020 has aborted with upgrade domain: , fault domain: , address: 172.17.0.2, hostname: 4f1a703e0800, isSeedNode: true, versionInstance: 6.4.625.1:v1, id: cf68563e16a44f808e86197a9cf83de5, dca instance: 131929857635362710
Opening Fabric node.

The http://localhost:19080/ gives an empty response.

anantshankar17 commented 5 years ago

@alexeyzimarev can you please check if there is sufficient disk space for the container ? I think something around 60GB free space would be required for the 3 node cluster to come up healthy. If that is already present, can you please share the traces from the dev cluster folder.

tastyeggs commented 5 years ago

I have the same issue. Disk space was the issue (on a different mac), here I have given docker enough disk space; yet it just appears to be stuck.

Would be nice to have some documentation on where to find the logs of what's actually going on when the cluster is starting (docker logs shows exactly the same output as above).

alexeyzimarev commented 5 years ago

It might be the size of the Docker disk in Docker for Mac settings. The default there is 60 Gib. On Sat, 9 Feb 2019 at 06:20, tastyeggs notifications@github.com wrote:

I have the same issue. Disk space was the issue (on a different mac), here I have given docker enough disk space; yet it just appears to be stuck.

Would be nice to have some documentation on where to find the logs of what's actually going on when the cluster is starting (docker logs shows exactly the same output as above).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/1428#issuecomment-462015005, or mute the thread https://github.com/notifications/unsubscribe-auth/ACsMVQGbW2S5LuPPbMeiEdH2gy_MNmRjks5vLlqsgaJpZM4aUMru .

-- Med vennligst hilsen / Best regards, Alexey V. Zimarev

tastyeggs commented 5 years ago

That's not the case for me; I've verified, and set it to 150G. I've also exec'ed into the docker container at this point, and there's plenty of free space.

It was the issue on my first attempt (on a different machine), which upon solving ended up working.

tastyeggs commented 5 years ago

@suhuruli would be nice to get some advice on this.

This is blocking further adoption of Service Fabric within my company, as some members of my team are running Macs.

anantshankar17 commented 5 years ago

Hi @tastyeggs, can you please exec into your container and do a "df -h" and share the output ? I suspect this could be the docker overlay which is eating up the space on one of your impacted machines. So even if the docker setting allows you to set the max disk space to 150G for the docker overlay, that doesnt mean it has 150G free disk space available on the disk. The docker setting have a docker.raw disk image location, what is the size of that ? Can you do a df -h for the root of the path where the docker.raw resides ? If the docker overlay itself is eating a lot of space then there isnt enough free space at the docker.raw image location and the container would not work properly. This could be the issue in that case: https://github.com/docker/for-mac/issues/2297. A quick mitigation would be to reset docker to the factory settings. You may also look at this for other ways of reducing disk usage: https://djs55.github.io/jekyll/update/2017/11/27/docker-for-mac-disk-space.html. Let us know.

alexeyzimarev commented 5 years ago

@tastyeggs why are you asking me? I have the same issue.

tastyeggs commented 5 years ago

@alexeyzimarev that was accidental tagging, sorry!

@anantshankar17, I'm sadly not able to reproduce this anymore, after I rebooted my Mac. I'm going to repeat the process on a fresh mac and will report back if I hit the same issue. Thanks for your guidance.