Closed gregory112 closed 2 years ago
...perhaps in the short term we could have a way to "force" detach volumes? I'm racking up quite a number of them :(
@dcarbone - check out the nomad volume deregister -force
and nomad volume create
flow, it's what solves my problems.
FWIW, a very reliable way of triggering this problem is to try to update a job in-place. Update a
template
stanza, or one of theservice.tags
values. When I try to do this with Terraform, there is around a 1/3 chance that it'll cause at least one of the volumes to become "stuck".
Thanks @dcarbone. This is particularly interesting because the client-side hook doesn't run at all when we do an in-place update of the allocation (which it's a different kind of bug on its own, and why I opened https://github.com/hashicorp/nomad/issues/11786). If the allocation isn't being rescheduled I wouldn't expect the CSI claims to change, so that helps narrow down where the behavior is going wrong. Or adds a new place where it's going wrong.
Ok, I've been away for a couple days but I wanted to give a status update on today's investigation.
I discovered today that our Controller RPC for CreateVolume
doesn't seem to be working correctly with democratic-csi
, or at least not with NFS. The directory on the NFS server isn't being created as I'd expect so the mount fails. This was the cause of the alloc failure in https://github.com/hashicorp/nomad/issues/10927#issuecomment-1013467062. I'll need to debug that as well and see whether it's at our side or the plugin's side (probably ours, tbh). But that's not strictly related to the issue here; it just happened to be what caused the initial alloc failure.
I tried to reproduce in-place updates causing the issue as @dcarbone reported, but I haven't been able to reproduce that for anything that's truly an in-place update (ex. service name update) and doesn't cause a reschedule.
There's a method in the state store code CSIVolumeDenormalizeTxn
where we "denormalize" allocations for a volume by querying their current state. We backfill claims there for backwards compatibility with pre 1.1.0 state, but I noticed we're not taking advantage of that moment to ensure that any nil allocations are in the PastClaims
list as well.
I was able to get an interesting reproduction where a single-node-writer
job was stopped + purged, and then a new job claimed the same volume in multi-node-multi-writer
mode (with 2 allocs). This resulted in the following:
$ nomad volume status csi-volume-nfs0
ID = csi-volume-nfs0
Name = csi-volume-nfs0
External ID = csi-volume-nfs0
Plugin ID = org.democratic-csi.nfs
Provider = org.democratic-csi.nfs
Version = 1.4.3
Schedulable = true
Controllers Healthy = 2
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 3
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = <none>
Namespace = default
Allocations
ID Node ID Task Group Version Desired Status Created Modified
399b8755-a1c5-2c2b-8d69-3e635947f154 ef15552c-6a06-bf35-4e6b-4cb37b624e08 web 7 stop complete 8m47s ago 6m23s ago
93a9b426-ce6b-224c-297d-f9f0e9dc02bf 1d682e06-3776-55cb-38b6-8d53a83cb336 web 0 run running 2m48s ago 2m37s ago
4e4fdd6e-1202-31f9-bff4-6e377672f106 b05f0de6-7cbd-dc6f-3335-b83ad15b1696 web 0 run running 2m48s ago 2m37s ago
Note that the access and attachment mode are stuck on the old mode, and we see the old allocation floating around. Which means the stopped allocation was being held as a current write claim.
I then drained one of the nodes running an allocation, intentionally neglecting to use -ignore-system
. As we'd expect (ref https://github.com/hashicorp/nomad/issues/11614), the unmounting step failed because there was no node plugin available.
The job status was very interesting after that:
$ nomad job status httpd
...
Placement Failure
Task Group "web":
* Class "vagrant": 2 nodes excluded by filter
* Constraint "CSI volume csi-volume-nfs0 has exhausted its available writer claims": 2 nodes excluded by filter
Latest Deployment
ID = f698e0f7
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
web 2 2 2 0 2022-01-19T16:49:45Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
4e4fdd6e b05f0de6 web 0 stop complete 2h28m ago 19s ago
93a9b426 1d682e06 web 0 run running 2h28m ago 2h28m ago
The scheduler should only be worried about multi-node-multi-writer
claims, for which we have no limits. That led me to this block of code in the scheduler at feasible.go#L326
:
if !vol.WriteFreeClaims() {
// Check the blocking allocations to see if they belong to this job
for id := range vol.WriteAllocs {
a, err := c.ctx.State().AllocByID(ws, id)
if err != nil || a == nil ||
a.Namespace != c.namespace || a.JobID != c.jobID {
return false, fmt.Sprintf(
FilterConstraintCSIVolumeInUseTemplate, vol.ID)
}
}
}
The WriteFreeClaims
method assumes that we only ever have one mode set, which isn't the case with our stuck claim. But then the check to make sure that this job is not blocking it's own claims is hitting that a == nil
condition and failing.
The raft state for these claims looks like this:
{
"AllocationID": "399b8755-a1c5-2c2b-8d69-3e635947f154",
"NodeID": "ef15552c-6a06-bf35-4e6b-4cb37b624e08",
"ExternalNodeID": "",
"Mode": 1,
"AccessMode": "single-node-writer",
"AttachmentMode": "file-system",
"State": 0
}
{
"AllocationID": "4e4fdd6e-1202-31f9-bff4-6e377672f106",
"NodeID": "b05f0de6-7cbd-dc6f-3335-b83ad15b1696",
"ExternalNodeID": "",
"Mode": 1,
"AccessMode": "multi-node-multi-writer",
"AttachmentMode": "file-system",
"State": 0
}
{
"AllocationID": "93a9b426-ce6b-224c-297d-f9f0e9dc02bf",
"NodeID": "1d682e06-3776-55cb-38b6-8d53a83cb336",
"ExternalNodeID": "",
"Mode": 1,
"AccessMode": "multi-node-multi-writer",
"AttachmentMode": "file-system",
"State": 0
}
My next steps are:
CSIVolumeDenormalizeTxn
to ensure nil allocations are in the PastClaims
list.PastClaims
aren't blocking write claims.Very appreciative of the thorough writeup keeping us in the loop, thank you!
Repeated messages are frowned upon on GH, but I must thank you too. For both looking into this and keeping us in the loop. I and probably most, really appreciate it. Thanks again
Excellent, thank you so much @tgross for the updates, it's extremely helpful to see the the progress on this.
I've built out a test rig that recreates the state I captured, and I've written a couple of tests that pass it through the volumewatcher
and the GC job. And fortunately these tests fail with exactly the same behavior I witnessed yesterday.
I've opened https://github.com/hashicorp/nomad/pull/11890 with a draft approach. If a volume has any allocations that are nil, create a "past claim" on it when we read the volume. This approach recognizes that we may never be able to fix all causes of desync between volumes and claims, because a node could potentially just up and disappear on us. This way trying to catch events from the clients is a matter of optimization and not correctness.
In manual testing of that patch, I discovered that we were not resetting the volumewatcher
's ACL token on leadership transitions. That means that even once we fix the claim state, the unpublish workflow will eventually start failing unrecoverably with errors like:
2022-01-20T19:13:22.661Z [DEBUG] core.sched: forced volume claim GC
2022-01-20T19:13:22.662Z [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=18446744073709551615 csi_volume_claim_gc_threshold=5m0s
2022-01-20T19:13:22.664Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-volume-nfs0
error=
| 1 error occurred:
| * Permission denied
|
Or, if you detach manually the error is particularly confusing because you might think it's talking about your ACL token:
2022-01-20T19:12:57.559Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-volume-nfs0
error=
| 1 error occurred:
| * Permission denied
|
Either case leaves us in a state where volume claims have been released but the volumes are still physically mounted:
I've opened https://github.com/hashicorp/nomad/pull/11891 with a fix for that.
Next steps for https://github.com/hashicorp/nomad/pull/11890:
CSIVolumeDenormalize
method private and force all queries of volumes to hit it at the state store level, so that we ever can't miss doing so and end up playing wack-a-mole with this bug all over the code base.I mentioned yesterday the issue I saw with CreateVolume
and the democratic-csi
plugin. I've opened https://github.com/hashicorp/nomad/issues/11893 to investigate that separately.
Also, yesterday I mentioned:
I've also got a patch going to do client-driven node unpublish and another moving claims from the client side to the plan applier, but that's probably not going to be ready till next week.
I made a brief attempt at moving claims into the plan applier, but it turned out to be really awful, so that doesn't seem like the right approach after all. But the client-driven node unpublish looks feasible. I have a work-in-progress patch for that at https://github.com/hashicorp/nomad/pull/11892 but it doesn't quite work yet because the unpublish workflow doesn't tolerate it yet.
I'll be away tomorrow but will be picking all this up again on Monday.
I've spent much of today testing out the patch in https://github.com/hashicorp/nomad/pull/11890. In my testing I have yet to encounter this issue again once I've applied that patch (and https://github.com/hashicorp/nomad/pull/11891) to the cluster.
However, I'm still seeing volumes left attached unexpectedly as I described in https://github.com/hashicorp/nomad/issues/10927#issuecomment-1017920830. The nomad volume detach
command doesn't seem to be working either, so my next step is to figure out if this patch breaks the unpublishing workflow unexpected or whether that's a separate problem.
I still haven't figured out what's going on with the access mode when releasing claims. It should be set to <none>
so that the next claim can update it, but it's not. At this point I'm reasonably confident it's not tied to this problem though, so I'm going to split that out to a new issue: https://github.com/hashicorp/nomad/issues/11921
An interesting test case I encountered was purging a job and then immediately re-running the job. I was able to produce a case where the evaluation blocked because the allocations that were claiming writes had not yet unpublished the volume. But by the time the blocked eval was evaluated, the volume was free for scheduling and everything worked as expected without intervention. It looked like this:
$ nomad job stop -purge httpd
==> 2022-01-24T13:32:36-05:00: Monitoring evaluation "fbf2c487"
2022-01-24T13:32:36-05:00: Evaluation triggered by job "httpd"
==> 2022-01-24T13:32:37-05:00: Monitoring evaluation "fbf2c487"
2022-01-24T13:32:37-05:00: Evaluation within deployment: "cbe85da5"
2022-01-24T13:32:37-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-24T13:32:37-05:00: Evaluation "fbf2c487" finished with status "complete"
==> 2022-01-24T13:32:37-05:00: Monitoring deployment "cbe85da5"
✓ Deployment "cbe85da5" successful
2022-01-24T13:32:37-05:00
ID = cbe85da5
Job ID = httpd
Job Version = 2
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
web 2 2 2 0 2022-01-24T18:33:01Z
$ nomad job run ./jobs/csi/httpd.nomad
==> 2022-01-24T13:32:40-05:00: Monitoring evaluation "faae38d5"
2022-01-24T13:32:40-05:00: Evaluation triggered by job "httpd"
==> 2022-01-24T13:32:41-05:00: Monitoring evaluation "faae38d5"
2022-01-24T13:32:41-05:00: Evaluation within deployment: "5fcc4c00"
2022-01-24T13:32:41-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-24T13:32:41-05:00: Evaluation "faae38d5" finished with status "complete" but failed to place all allocations:
2022-01-24T13:32:41-05:00: Task Group "web" (failed to place 2 allocations):
* Class "vagrant": 3 nodes excluded by filter
* Constraint "CSI volume csi-volume-nfs0 has exhausted its available writer claims": 3 nodes excluded by filter
2022-01-24T13:32:41-05:00: Evaluation "aa76c24a" waiting for additional capacity to place remainder
==> 2022-01-24T13:32:41-05:00: Monitoring deployment "5fcc4c00"
✓ Deployment "5fcc4c00" successful
2022-01-24T13:33:04-05:00
ID = 5fcc4c00
Job ID = httpd
Job Version = 0
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
web 2 2 2 0 2022-01-24T18:42:52Z
This is actually working as intended, but the error message is confusing here. So I've added a commit onto https://github.com/hashicorp/nomad/pull/11890 that will specifically call out "hey this is a GC'd allocation and it should resolve itself momentarily".
Quick follow-up on this:
However, I'm still seeing volumes left attached unexpectedly as I described in #10927 (comment). The
nomad volume detach
command doesn't seem to be working either, so my next step is to figure out if this patch breaks the unpublishing workflow unexpected or whether that's a separate problem.
There's some easy ways to end up with stray mounts if you node drain without -ignore-system
, but (a) we're going to fix that separately, and (b) once I restored the node plugin I was always able to nomad node detach
in that case.
But I was able to eventually reproduce this by repeatedly purging a job and immediately re-running it with the access mode changed. The goal was to make it so that publishing was happening at roughly the same time we were trying to detach the volume. This ended up with a client that had the mount for a previous version of the job, but no alloc:
$ mount | grep nfs
192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/staging/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.2,local_lock=none,addr=192.168.56.69)
192.168.56.69:/var/nfs/general/v/csi-volume-nfs0 on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/per-alloc/3f9bfe0b-0541-b83d-fd8b-a2b16ddcc318/csi-volume-nfs0/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.56.2,local_lock=none,addr=192.168.56.69)
Here are the logs for the NodePublishVolume
and NodeUnpublishVolume
workflows on the node plugin:
What we can see here is that the volume was mounted at rw-file-system-single-node-writer
but when we're unpublishing we're doing so at rw-file-system-multi-node-multi-writer
. So we'll never be able to unmount this because we're trying to unmount it at the wrong location. This is also why trying to detach it via nomad node detach
results in a silent failure. We see the log for the HTTP request and then nothing else. We just aren't logging this code path.
So there's a race where a change in the access mode can land on the client during unpublish. There's a few obvious places that bug can be happening, so I'll pick that up again tomorrow.
Ok, the race described above turns out to be pretty straightforward. In the client's (*csiHook) Postrun()
method, we make an unpublish RPC that includes a claim in the CSIVolumeClaimStateUnpublishing
state and using the mode from the client. But then in the (*CSIVolume) Unpublish
RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the current volume's view of the mode (ref nodeUnpublishVolumeImpl
). If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed.
Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in (*CoreScheduler) csiVolumeClaimGC
doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes.
nodeUnpublishVolumeImpl
to always use the claims mode (maybe falling back to the volume's mode if unset?).vol.AccessMode
to ensure we're not making this mistake elsewhere.I've done some testing of https://github.com/hashicorp/nomad/pull/11892 and that's in good shape as well, and I've walked some team mates through the rest of the PRs so that we can get them reviewed and merged one-by-one. Unfortunately I've run into a non-CSI issue around the stop_after_client_disconnect
flag https://github.com/hashicorp/nomad/issues/11943 which is going to impact a lot of CSI use cases (we created that flag specifically for stateful workloads). But we should be able to ship an incremental patch release with a bunch of the work done so far. Stay tuned for updates on when that may land as the PRs get merged.
The 4 patches for this issue (which also covers https://github.com/hashicorp/nomad/issues/10052 and https://github.com/hashicorp/nomad/issues/10833 and https://github.com/hashicorp/nomad/issues/8734) have been merged to main
. We're working up a plan for a patch release shortly, at which point I'll close out this issue. Note this won't yet fix the issues in #8609 but it does make that an automatically recoverable situation. I'm going to be focusing heavily on knocking out the rest of the major CSI issues over the next month or two after we've shipped this.
Thanks again for your patience, folks.
We thank you dude! I can finally recommend nomad to people without a BUT, i've always wanted be able to do that.
This is awesome work @tgross! Highly appreciate the transparency and the depth at which you kept the community updated, this thread should become the gold standard for Organization <--> Community interactions!
Nomad 1.2.5 has shipped with the patches described here. We still have a few important CSI issues to close out before we can call CSI GA-ready, but it should be safe to close this issue now. Please let us know if you run into the issue with the current versions of Nomad. Thanks!
@tgross I recently upgrade to v1.2.6 and I'm still hitting the issue.
After a few days of the job running, attempting to deploy ends up with the dreaded "CSI volume nomad volume deregister --force
ritual.
@m1keil the problem isn't entirely eliminated until 1.3.0 (which will get backported). https://github.com/hashicorp/nomad/pull/12113 should cover the remaining cases.
@tgross I have still managed to reproduce with v1.3.0-beta.1 I think this could be related to a firewall issue I had. The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here? Maybe the volume allocation should be released after a long timeout has expired?
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
registry 1 0 0 0 0 0 0
Placement Failure
Task Group "registry":
* Constraint "CSI volume ceph-registry has exhausted its available writer claims and is claimed by a garbage collected allocation 45c884a0-e629-b180-eb44-c3c7a5d5e7b7; waiting for claim to be released": 3 nodes excluded by filter
Latest Deployment
ID = 687ebec8
Status = running
Description = Deployment is running
# nomad volume status ceph-registry
ID = ceph-registry
Name = ceph-registry
External ID = 0001-0024-22b9524c-4a2d-4a56-9f3d-221ee05905a2-0000000000000001-af535f2b-9f9a-11ec-bf1c-0242ac110003
Plugin ID = ceph-csi
Provider = cephfs.csi.ceph.com
Version = v3.6.0
Schedulable = true
Controllers Healthy = 1
Controllers Expected = 4
Nodes Healthy = 3
Nodes Expected = 3
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = <none>
Namespace = default
Allocations
No allocations placed
The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here?
A bit more detail would be helpful. Which task allocation do you mean? The plugin or the allocation that mounted the volume? I would not have expected the allocation that mounted the volume to be actually gone because we block for a very long time retrying the unmount unless the client node was also shut down as well.
Did the node plugin for that node get restored to service? If so, the expectation is that the volumewatcher loop will eventually be able to detach it. Or is the client host itself gone too?
Maybe the volume allocation should be released after a long timeout has expired?
Unfortunately many storage providers will return errors if we do this. For example, imagine an AWS EBS volume attached to an EC2 instance where the Nomad client is shut down. We can never physically detach the volume until the client is restored, so all attempts to attach it elsewhere will fail unrecoverably. That's why we provide a nomad volume detach
and nomad volume delete
command, so that the operator can signal that the volume was detached out-of-band.
@tgross I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.
During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.
The network issue was eventually fixed, but the volume remained stuck.
That's why we provide a
nomad volume detach
andnomad volume delete
command, so that the operator can signal that the volume was detached out-of-band.
I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information. For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?
But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether. In my case I needed to deregister/reregister the volume again.
Also the deregister/reregister dance is impossible if the volume has been created with terraform, as the volume will be deleted automatically if the resource is removed.
I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.
During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.
Ok, so that would also have restarted the Node plugin tasks as well. When the node restarted, did the Node plugin tasks come back? The UI is maybe a little misleading here -- it shows the plugin as healthy if we can schedule volumes with it, not when 100% of plugins are healthy (otherwise we'd block volume operations during plugin deployments).
As for being stuck for 4 days... the evaluation for the rescheduled task hits a retry limit fairly quickly, so I suspect it's not "stuck" so much as "gave up." Once we've got the claim taken care of, re-running the job (or using nomad job eval
) should force a retry. But let's figure out the claim problem first. 😀
I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information.
It blocks because it's a synchronous operation; it has to make RPC calls to the Node plugin and to the Controller plugin. You should see logs on the Node plugin (first) and then the Controller plugin (second). If both the Node plugin and Controller plugin are live, that should work and you should be able to look at logs in both plugins (and likely the Nomad leader as well) to see what's going on there.
For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?
So long as it restarted and has a running Node plugin, yes that should still work just fine. But fair point about knowing which node the volume was attached to previously.
(hit submit too soon 😀 )
But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.
It's not possible for us to unconditionally detach but in theory we could unconditionally free a claim. That's what deregister -force
is designed to do, but yeah I realize that's not going to work out so well with Terraform.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
Operating system and Environment details
Ubuntu 20.04 LTS
Issue
Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using
nomad volume create
command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.What confuses me is the warning. It differs every time I run
job plan
. First I sawThen, runnig
job plan
again a few seconds after, I gotThen again,
I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.
This is the output of
nomad volume status
.It says that they are schedulable. This is the output of
nomad volume status zookeeper1-datalog
:It says there, there are no allocations placed.
Reproduction steps
This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with
nomad job stop -purge
. No, doingnomad system gc
,nomad system reconcile summary
, or restarting Nomad does not work.Expected Result
Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using
nomad volume create
so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).Actual Result
See error logs above.
Job file (if appropriate)
I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):
All groups have
count = 1
.