Open jonaslb opened 7 months ago
replicas: 3
will happily start all 3 replicas on a single node if that is the only available node given the services constraints.
mode: global
on the other hand acts like an implicit max_replicas_per_node: 1
is in play but it should only consider nodes that meed the explicit and implicit deployment constraints for scheduling.
I see what you're implying but no, the three replicas did not just all happen to start on a single functional node.
mode: global
will happily schedule tasks on nodes without the driver, by the way. But as mentioned, I've used constraints on the labels to exclude those. docker service ps
reveals that the tasks get scheduled on the correct nodes, where it works fine (ie. containers created and started) with replicas: N
but not with mode: global
.
Curiously, doing both at the same time (starting the same test-service with mode: global
and replicas: N
) allows the mode: global
tasks to start just fine. This feels like a docker/swarmkit bug.
Does the docker service ps --no-trunc <service name>
tell anything about what it is pending?
However, might be that this is simply untested feature. Would need build Docker from code and add some extra logging to see where it fails.
Unfortunately no extra info with the flag. The full "Current State" description is "Preparing x minutes ago". No containers for the service appear with docker ps -a
on the nodes.
When I start the service with replicas: N
, and I look at the plugin logs (cat /var/run/docker/plugins/xxxxx/*std*
) on a node that runs a task, I see this:
I0423 12:05:21.457352 8 utils.go:76] GRPC call: /csi.v1.Node/NodeStageVolume
I0423 12:05:21.457363 8 utils.go:77] GRPC request: {"secrets":"***stripped***","staging_target_path":"/data/staged/ivqh8gwq1911ol4m0tylqf3fw","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_context":{"ondelete":"retain","source":"//SMBHOST/SMBSHARE/","subdir":"subdir"},"volume_id":"SMBHOST/SMBSHARE#subdir#docker-volume-name#retain"}
I0423 12:05:21.564442 8 nodeserver.go:209] NodeStageVolume: targetPath(/data/staged/ivqh8gwq1911ol4m0tylqf3fw) volumeID(SMBHOST/SMBSHARE#subdir#docker-volume-name#retain) context(map[ondelete:retain source://SMBHOST/SMBSHARE/ subdir:subdir]) mountflags([]) mountOptions([])
I0423 12:05:21.856709 8 nodeserver.go:415] already mounted to target /data/staged/ivqh8gwq1911ol4m0tylqf3fw
I0423 12:05:21.856743 8 nodeserver.go:217] NodeStageVolume: already mounted volume SMBHOST/SMBSHARE#subdir#docker-volume-name#retain on target /data/staged/ivqh8gwq1911ol4m0tylqf3fw
I0423 12:05:21.856756 8 utils.go:83] GRPC response: {}
I0423 12:05:21.857515 8 utils.go:76] GRPC call: /csi.v1.Node/NodePublishVolume
I0423 12:05:21.857544 8 utils.go:77] GRPC request: {"secrets":"***stripped***","staging_target_path":"/data/staged/ivqh8gwq1911ol4m0tylqf3fw","target_path":"/data/published/ivqh8gwq1911ol4m0tylqf3fw","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_context":{"ondelete":"retain","source":"//SMBHOST/SMBSHARE/","subdir":"subdir"},"volume_id":"SMBHOST/SMBSHARE#subdir#docker-volume-name#retain"}
I0423 12:05:21.858098 8 nodeserver.go:81] NodePublishVolume: mounting /data/staged/ivqh8gwq1911ol4m0tylqf3fw at /data/published/ivqh8gwq1911ol4m0tylqf3fw with mountOptions: [bind] volumeID(SMBHOST/SMBSHARE#subdir#docker-volume-name#retain)
I0423 12:05:21.858129 8 mount_linux.go:218] Mounting cmd (mount) with arguments ( -o bind /data/staged/ivqh8gwq1911ol4m0tylqf3fw /data/published/ivqh8gwq1911ol4m0tylqf3fw)
I0423 12:05:21.859738 8 mount_linux.go:218] Mounting cmd (mount) with arguments ( -o bind,remount /data/staged/ivqh8gwq1911ol4m0tylqf3fw /data/published/ivqh8gwq1911ol4m0tylqf3fw)
I0423 12:05:21.861749 8 nodeserver.go:88] NodePublishVolume: mount /data/staged/ivqh8gwq1911ol4m0tylqf3fw at /data/published/ivqh8gwq1911ol4m0tylqf3fw volumeID(SMBHOST/SMBSHARE#subdir#docker-volume-name#retain) successfully
I0423 12:05:21.861785 8 utils.go:83] GRPC response: {}
When I start the service with mode: global
, i only get this:
I0423 12:13:59.129434 8 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0423 12:13:59.129451 8 utils.go:77] GRPC request: {"target_path":"/data/published/ivqh8gwq1911ol4m0tylqf3fw","volume_id":"SMBHOST/SMBSHARE#subdir#docker-volume-name#retain"}
I0423 12:13:59.129497 8 nodeserver.go:103] NodeUnpublishVolume: unmounting volume SMBHOST/SMBSHARE#subdir#docker-volume-name#retain on /data/published/ivqh8gwq1911ol4m0tylqf3fw
I0423 12:13:59.129538 8 utils.go:83] GRPC response: {}
which i find hard to make sense of. Essentially, it seems that docker treats these tasks and their volume mounts very differently depending on whether they are mode: global
or not.
I'm not going to dig further into this, because the way out of this for me right now is to not use mode: global
. But maybe this issue will be useful information for someone else in the future.
Hi, again thanks for pioneering this functionality! I'm currently testing the nfs (and smb) drivers. What I've realised is that while e.g.
deploy: {replicas: 3}
in the service definition works fine (for any number),deploy: {mode: global}
causes all tasks except one to be "Pending" and they never start. I don't know if it matters, but on this cluster, the driver is only installed on a selection of nodes (all of the manager nodes though), but I've also limited the service using placement constraints to only run on those, so I feel like it shouldn't be an issue, but maybe you know better.So anyway, it's not a huge deal because where we use
mode: global
we probably shouldn't, but still I'd like to hear if you know why this happens and if there's anything that can be done about it when using the same shared volumes (besides not using that deploy mode!).