Add a proposal for the Gluster CRDs

JohnStrunk commented 5 years ago

Describe what this PR does This PR provides a proposed set of CRs for the Gluster operator. While this proposal encompasses much more than what an initial implementation will support, my goal is to make room the features we know we will need to support eventually so we can incorporate them over time, maintaining compatibility of the CRD.

Is there anything that requires special attention?

The CRDs described here will tangentially interact with the StorageClass definition, currently open for review in gluster/gluster-csi-driver#30.
Though the main goal is moving toward GCS, I have attempted to incorporate sufficient information that this same CRD could be used to deploy a Heketi-based cluster.

While I expect this proposal to be fairly contentious, please try to offer concrete suggestions for improvement along with the criticisms. :pray:

Related issues: Closes #3

JohnStrunk commented 5 years ago

Folks who should take a look:

Operator/cns: @jarrpa @humblec @SaravanaStorageNetwork @obnoxxx @phlogistonjohn
CSI: @humblec @Madhu-1
GD2: @atinmu @aravindavk
GCS: @kshlm @amarts @ShyamsundarR @sabose ... and I'm sure I miss a bunch of folks.

jarrpa commented 5 years ago

An off-hand remark to start after my first skimming: While "quiesce" is a technically appropriate term, I think it's a little too uncommon and hard to quickly understand. I think either of "inactive" or "disabled" would be better.

JohnStrunk commented 5 years ago

Here's something I'd like opinions on: Currently, there is both a Cluster CR and a Zone CR that are mandatory. In many (most?) environments, the number of Zones would be small (1-3). Should we move the information of the Zone CR into the Cluster CR as a list? Imagine adding the following to the spec section of the Cluster CR (taking the place of 3x Zone CRs in the AWS multi-az example):

zones:
  - name: az1
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az1
  - name: az2
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az2
  - name: az3
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az3

Whatever would have gone into the status of the Zone CR could be placed into a corresponding list in the Cluster CR's status.

The benefit to the above change would be that the admin need only create a single CR to get a cluster up and running (or show their configuration). The down-side is that the CR is bigger w/ more fiddly indentation.

jarrpa commented 5 years ago

@JohnStrunk I say the ease of having only one CR is worth the increased length. In particular, I see the majority of users only having one Zone anyway.

phlogistonjohn commented 5 years ago

First off, this is a really awesome start! :-)

I have a couple questions based on my first read-through:

One thing I expected to see but didn't was a mechanism to specify what images should be used. In other words, something that the operator would use to select what image to use for initial deployments and/or upgrades in the future. Understandably, we may not be doing upgrades right away but I think automatic mode would need something to key off of in terms of images & tags.
I am quite fuzzy on the "manual" mode. To me the name makes it sound like an alternative to external, but that's not the way it appears to function. Could you clarify what this mode is supposed to do, please?

(Edit: The next two points I wrote overlap a bit with what @JohnStrunk and @jarrpa were discussing above. Apologies for any redundancy, but rather than interject into that subthread I'll leave what I wrote intact but admit that I didn't read through all that when I wrote my comments)

There seems to be a bit of asymmetry between the node CRD and the node references in the "external" mode section. Perhaps I'm misreading things but it sounds like the external config would be statically defined in the spec and the operator would not create node CRs. Regardless of that, I wonder if there wouldn't be an advantage to the admin (manually) creating & editing node CRs for the external case. This way all node configuration is done at the same "level" in the CRD hierarchy you lined out at the top of the doc. One disadvantage of my idea is that it might be tedious for initial deployment with the need to type out the yaml for each and every node... but perhaps that could be solved by having some template support there.
Currently, you have put the storage section at the same level as "management" and the intent appears to have this section apply to all items in the zone. My concern with this is that the current system defines the devices at the level of the node. This approach is preferred if you can create cookie-cutter nodes, but I wonder if we shouldn't also have a per-node or even multiple named storage subsections. I think that a situation like this could easily arise in a bare-metal on-prem situation like, "I bought some servers last year and they came with N disks" then "I expanded my cluster this year and the new servers have M disks and P ssds".

To expand on my "multiple named subsections Idea above". I imagine that we could define the storage section with something like:

  storage:
    # the default storage policy
    default:
      mode: device
      device: 
        - /dev/sdb
        - /dev/sdc
    # storage policy to be used on oldservers (TBD how oldservers are matched with the pods)
    oldservers:
      mode: device
      device:
        - /dev/sdc
        - /dev/sdd
        - /dev/sde

If we wanted to do per node storage we could have a key like "storageOverride".

jarrpa commented 5 years ago

@phlogistonjohn I would modify your proposal to have spec.storage be a list, and each item in storage have a name key to identify. Then you could have spec.zone.storage be a list of names of storage that apply to that zone.

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

phlogistonjohn commented 5 years ago

I would modify your proposal to have spec.storage be a list, and each item in storage have a name key to identify. Then you could have spec.zone.storage be a list of names of storage that apply to that zone.

I understand using a list, it's six of one, half dozen of the other for me. I'm not sure what the 2nd sentence is getting at though. Could you provide an example of what you mean?

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

That sounds a bit too hacky / clever to me. If I really thought asymmetry in the storage devices of the nodes really would be a legacy thing you might be able to convince me. However, I really do think these sorts of asymmetries will occur naturally in the real world as clusters age & grow. If we need to have a strict matching of device layout to abstraction, it seems that maybe we need an additional level of abstraction between node and zone. This may not need it's own CRD but it would prevent us from overloading the concept of zone beyond the natural understanding that users will have (roughly an Availablity Zone) IMO.

jarrpa commented 5 years ago

I understand using a list, it's six of one, half dozen of the other for me. I'm not sure what the 2nd sentence is getting at though. Could you provide an example of what you mean?

A list would be more graceful in implementation, IMO. :) As far as the second sentence, think like volumes and volumeMounts in kube Pods. To put the two together, something like this:

spec:
  management:
    mode: external
    external:
      manual:
        nodes: 2
        storage:
        - storage1
  storage:
  - name: storage1
    mode: device
    device: 
      - /dev/sdb
      - /dev/sdc
  - name: storage2
    mode: device
    device:
      - /dev/sdc
      - /dev/sdd
      - /dev/sde

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

That sounds a bit too hacky / clever to me. If I really thought asymmetry in the storage devices of the nodes really would be a legacy thing you might be able to convince me. However, I really do think these sorts of asymmetries will occur naturally in the real world as clusters age & grow. If we need to have a strict matching of device layout to abstraction, it seems that maybe we need an additional level of abstraction between node and zone. This may not need it's own CRD but it would prevent us from overloading the concept of zone beyond the natural understanding that users will have (roughly an Availablity Zone) IMO.

In that case, we could expand the Zone spec to allow multiple entries in spec.management. To expand on the above:

spec:
  management:
  - mode: external
    external:
      manual:
        nodes: 2
        storage:
        - storage1
  - mode: automatic
    automatic:
      manual:
        minNodes: 2
        maxNodes: 99
        storage:
        - storage2
  storage:
  - name: storage1
    mode: device
    device: 
      - /dev/sdb
      - /dev/sdc
  - name: storage2
    mode: volume
    volume:
      size: 500Gi
      storageClassName: my-az1-class

I'm not sure how likely it is to have mixed-management-mode clusters, but it should be technically feasible. More likely, as you said, you'd only use this for external clusters that grow over time.

JohnStrunk commented 5 years ago

Following up on some of the questions/discussion so far...

Since my question about merging the Zone w/ the Cluster got a positive response, I'm going to write up and push that change.

@phlogistonjohn You asked about automatic/manual/external... (I think you figured it out, so this is more for the benefit of others following along) The idea is that there are 2 way to run: in the kube cluster as pods or connecting to an external gluster cluster. "automatic" and "manual" would be for running as pods, and the defference between these is just that in auto mode, the operator can create & destroy nodes at will, whereas in manual mode, it maintains a fixed number of nodes in that zone. External mode is for using an external cluster. In that case, you have to tell the operator where to find the nodes, and you also have to manage them yourself (create/destroy/upgrade). Perhaps a better naming scheme would be: dynamic, fixed, external?

I, too, have a dislike for the asymmetry of the "external" nodes in the Zone. I'm trying to come up with a way to push the device list down to the Node CR (so we can use device UUIDs), and convert what is now at the Zone level into a "template" (as that's what it actually is). Perhaps with some re-arranging, it's possible to make the "external" nodes have the contact info in Node CRs, and have corresponding Zones with a "custom" setting (i.e., child Nodes are created manually and don't follow a template). :thinking:

For storage.mode: device, there's just no safe way to use /dev/sd*. I was trying to figure out a way to, on first start, record the UUID of the device, but there's no guarantee that even on 1st start the devices are what the admin thinks they are. So, I'm starting to be rather pessimistic about finding a way to templatize mode=device at all.

phlogistonjohn commented 5 years ago

In that case, we could expand the Zone spec to allow multiple entries in spec.management. To expand on the above:

I think I like this, although I think this "thing" needs some sort of name so we can talk about it. (Not entirely serious suggestions: corrals)

@phlogistonjohn You asked about automatic/manual/external... (I think you figured it out, so this is more for the benefit of others following along)

Actually, I hadn't, so this is great.

The idea is that there are 2 way to run: in the kube cluster as pods or connecting to an external gluster cluster. "automatic" and "manual" would be for running as pods, and the defference between these is just that in auto mode, the operator can create & destroy nodes at will, whereas in manual mode, it maintains a fixed number of nodes in that zone. External mode is for using an external cluster. In that case, you have to tell the operator where to find the nodes, and you also have to manage them yourself (create/destroy/upgrade). Perhaps a better naming scheme would be: dynamic, fixed, external?

I agree a better name would help make the concept clearer. I also think this is something akin to the current approach being taken by the gk-deploy and openshift-ansible approaches, no? If that's so, the fixed count threw me a little bit. IIRC the current style is based on node labeling. If I'm not too off base would it make sense to make that counter a minimum value (used to assist with upgrades / reboots) and an optional label? Perhaps a list of starter nodes? As for naming maybe something like: "predefined" or "restricted"?

I, too, have a dislike for the asymmetry of the "external" nodes in the Zone. I'm trying to come up with a way to push the device list down to the Node CR (so we can use device UUIDs), and convert what is now at the Zone level into a "template" (as that's what it actually is). Perhaps with some re-arranging, it's possible to make the "external" nodes have the contact info in Node CRs, and have corresponding Zones with a "custom" setting (i.e., child Nodes are created manually and don't follow a template). thinking

+1 to the sentiment. I think what might be helpful here is to imagine what the user workflow would be. Since I think its not particularly good UX to make someone type a lot of boilerplate we could have a sort of initial template (initialNodes?) buried in the zone, but that always expands out to node CRs and post install we always manage via nodes?

For storage.mode: device, there's just no safe way to use /dev/sd*. I was trying to figure out a way to, on first start, record the UUID of the device, but there's no guarantee that even on 1st start the devices are what the admin thinks they are. So, I'm starting to be rather pessimistic about finding a way to templatize mode=device at all.

Hehe. This mirrors a discussion I have been part of at the heketi repo. I'm convinced that normal people think of their devices as /dev/sdX etc. But, that's not "safe" for long term use. But I think we're relatively safe if we are up front about saying these device nodes are what's used for initial discovery but not long term use. I think it could be sufficient if the operator only uses these devices to feed the next step in the management chain (eg. heketi or gd2) and then perhaps rely on those subsystems to come back to the operator with a stable identifier?

jarrpa commented 5 years ago

I've finally taken the time to do a more thorough review of the proposal. Below are my thoughts as well as some repeat of prior discussion:

Cluster CRD:
- Rename CRD to GlusterCluster
- glusterCA:
- Should probably have name and namespace fields, just to make parsing and readability a little easier.
- Maybe rename to glusterCASecret.
- It should also be optional.
- Per previous discussion, I support the idea of having a spec.zones field in the cluster CRD. The field would be a list of what is currently the Zone CR spec, with an additional spec.zones[i].name field as an identifier.
Zone CR / zones spec:
- Each Zone should have an optional disabled boolean field. If true, all nodes in that Zone should be marked as disabled and the Operator should not perform any actions on those nodes.
- The current set of management types is a little confusing. I like @JohnStrunk's suggested names, but I think we could keep most of the originals and go with dynamic, manual, and external.
- I brought up the idea of having multiple "management" entries in the Zone spec, so that we avoid using the Zone grouping for something other than its intended purpose (deliniating failure domains). Thinking on it further, it may make sense to introduce the concept of "node management groups", and rename the management field to nodeTemplates.
- In fact, we could move the management spec out of spec.zones into spec.nodeTemplates where each nodeTemplate has a name identifier, then have spec.zones[n].nodes which would be a list of node template names.
- We technically don't really need the mode field at all. Speaking from Go, we could take a page from the k8s PersistentVolume spec and just have an ManagementType field that's anonymous in the parent struct, with fields of the format DynamicManagementType `json:dynamic,omitempty` such that we only need to specify the type field. The operator would have to check to make sure that only one management type field is specified, and fail otherwise.
- nodeTemplate should be extended to have a stateVolume field that defines the type of storage that would be used to maintain each node's state. This field would only apply to dynamic and manual management types, and be ignored for external types. stateVolume should have the following fields:
- local: boolean. If True, indicates that a hostPath volume should be used and all other stateVolume fields should be ignored. The Operator would take care of determining the path on the host.
- claimName: A specific PVC name to use.
- claimNamespace: PVC's namespace, optional
- storageClassName: StorageClass for a dynamic PV. Ignored if either local or claimName are used.
- size: Size of state volume. Defaults to 1GiB If no fields are specified, a 1GiB volume would be provisioned from the cluster's default StorageClass.
- The external mode should allow for a name and namespace for the credentials Secret, much like I propose for glusterCA.
- The storage field should be renamed to storageTemplates, with subfields of name, devices, and `volumes.
- The storageTemplates[n].devices list should behave like storage.device and be extended to accept regular expressions of device names.
- The storageTemplates[n].volumes field should be a list of possible PV types. Fields would include:
  - storageClassName: a StorageClass name to provision dynamic PVs.
  - size: size of dynamic PVCs from storageClassName. Defaults to 1GiB
  - minCount: minimum number of dynamic PVs to provision, defaults to 1.
  - maxCount: maximum number of dynamic PVs to provision. If not specified, limit is 1 for manual nodes and -1 (no limit) for dynamic nodes. If no fields are specified, a 1GiB volume would be provisioned from the cluster's default StorageClass.
- If a storageTemplate specifies neither devices nor volumes, the volumes defaults are assumed.
- To reference a storageTemplate, a nodeTemplate should have a storage field, which is a list of storageTemplates. If the storage field is absent, a default storageTemplate is used that behaves with the volumes defaults.

Some examples that put together the above:

Minimalist Cluster CR

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
  name: my-cluster
  namespace: gcs
spec:
  zones:
  - name: az1
    nodes:
    - nodes-dynamic
  nodeTemplates:
  - name: nodes-dynamic
    dynamic:
      minNodes: 2
      maxNodes: 99

Complex Cluster CR

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
  name: my-cluster
  namespace: gcs
spec:
  clusterOptions:
    "cluster.halo-enabled": "yes"
  glusterCA: # optional
    name: ca-secret
    namespace: gcs # optional, defaults to GlusterCluster's namespace
  zones:
  - name: az1
    nodes:
    - nodes-dynamic
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
                - us-west-2a
  - name: az2
    nodes:
    - nodes-dynamic
    disabled: True
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
                - us-east-2a
  - name: az3
    nodes:
    - nodes-manual
    - nodes-external
  nodeTemplates:
  - name: nodes-dynamic
    dynamic:
      minNodes: 2
      maxNodes: 99
      storage:
      - storage1
  - name: nodes-manual
    manual:
      nodes: 2
      storage:
      - storage2
    stateVolume:
      storageClassName: manual-state-class
  - name: nodes-external
    external:
      nodes:
      - 1.2.3.4
      - some.node.net
      credentialsSecret: # required
        name: ssh-secret
        namespace: gcs # optional, defaults to the GlusterCluster's namespace
  storageTemplates:
  - name: storage1
    volumes:
    - size: 500Gi
      storageClassName: az1-class
      minCount: 1
      maxCount: 100
  - name: storage2
    devices: 
    - /dev/sdb
    - "/dev/sd[d-f]"

Node CRD:
- CRD should be renamed to GlusterNode.
- Metadata should include the node group, as defined above for the cluster CRD.
- targetState could be renamed to desiredState
- Available states should be active, disabled, and unavailable. Only active and disabled are permitted for spec.desiredState, while all three may be present in status.currentState. unavailable in this case indicates that the Operator can currently not perform management operations on the node but would keep trying until the node comes back, is configured disabled, or is deleted. If the Node CR is marked for deletion but is in an unavailable state, the Operator would simply remove the node from the cluster configuration and allow the resource deletion to succeed.
- desiredState would default first to disabled if the owning cluster's CR's spec.zones[n].disabled field if True, otherwise default to active.
- The spec should allow for node-specific overrides of configurations in the cluster CR, namely spec.storage, spec.nodeAffinity, and spec.credentialsSecret. The overrides would behave differently depending on the node's management mode:
- dynamic: all spec fields are immutable except for spec.desiredState
- manual: all fields are mutable and optional, would override the cluster configs if present
- external: all fields are mutable, with most being optional except for spec.storage
- spec.storage would be a list that behaves as the cluster CRD field spec.storageTemplates, with the following exceptions:
- spec.storage entries would not have a name field.
- spec.storage[n].volumes entries would be extended to recognize claimName and claimNamespace fields that identify a specific PVC for use as a Block volume.

Examples for the above:

Node CR - Dynamic (created by Operator, immutable except for spec.desiredState)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: az1-nodes-dynamic-001
  namespace: gcs
  cluster: my-cluster
  zone: az1
  nodeGroup: nodes-dynamic
spec:
  desiredState: active
status:
  currentState: active

Node CR - Manual (created by Operator, mutable by admin)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: az1-nodes-manual-001
  namespace: gcs
  cluster: my-cluster
  zone: az3
  nodeGroup: nodes-manual
spec:
  storage:
  - volumes: 
    - claimName: block-pvc
      claimNamespace: gcs # optional, defaults to the GlusterNode's namespace
  desiredState: disabled
status:
  currentState: disabled

Node CR - External (created by admin, mutable by admin)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: some.node.net
  namespace: gcs
  cluster: my-cluster
  zone: az3
  nodeGroup: nodes-external
spec:
  credentialsSecret: # required
    name: ssh-secret
    namespace: gcs # optional, defaults to the GlusterNode's namespace
  storage:
  - devices: 
      - /dev/sdb
  desiredState: active
status:
  currentState: unavailable

This all seems somewhat fiddly, but I think it's worth it to try and strike a balance between flexibility and maintenance overhead. If we wanted, we could ship utilities to generate examples inside the Operator container, such that you could run something like kubectl exec <anthill_pod> /gen-cluster-template.sh > cluster.yaml and have something to start from.

I have some more thoughts on cluster/node state, but I'll save those for another comment. :)

JohnStrunk commented 5 years ago

Sorry to have disappeared for so long on this one...

@phlogistonjohn asked:

One thing I expected to see but didn't was a mechanism to specify what images should be used.

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

JohnStrunk commented 5 years ago

@jarrpa The configurability seems to have exploded. :) If I back up a step here, I think we want 3 possible configurations for a gluster pod:

Usage of block PV (desired)
Usage of device in /dev (as stopgap until 1 is everywhere)
External nodes I don't think I want to expose directly specifying the PVCs that would be used in (1). Do you have a usecase in mind where that is important?

JohnStrunk commented 5 years ago

We technically don't really need the mode field at all. Speaking from Go, we could take a page from the k8s PersistentVolume spec and just have an ManagementType field that's anonymous in the parent struct...

If I'm understanding correctly this would be:

management:
  automatic:
    # stuff for auto

or

management:
  manual:
    # stuff for manual

but an error if

management:
  automatic:
    # stuff
  manual:
    # stuff

so the mode: auto|manual becomes superfluous. I like it. :+1:

JohnStrunk commented 5 years ago

I'm convinced that normal people think of their devices as /dev/sdX etc. But, that's not "safe" for long term use. But I think we're relatively safe if we are up front about saying these device nodes are what's used for initial discovery but not long term use. I think it could be sufficient if the operator only uses these devices to feed the next step in the management chain (eg. heketi or gd2) and then perhaps rely on those subsystems to come back to the operator with a stable identifier?

I think this is really dangerous. The CRs hold the authoritative, declarative state. The only (almost) safe thing I can think of would be for the operator to resolve to UUIDs, then overwrite the sdX w/ UUIDs in the CR. Putting it in the status: doesn't fix the problem, nor does trying the preserve a sdX==>UUID mapping someplace. It works if nothing will ever be changed in the future (add or remove), but breaks as soon as changes are allowed. Complicating this is that the operator modifying the admin's settings is frowned upon (and a race).

JohnStrunk commented 5 years ago

So we're talking about splitting the "template" functionality from the "failure domain" specification. Is this really necessary? I feel like it's bringing in added room for misconfiguration. Leaving them tied together means you can't put "dev"-backed pods in the same domain as "pv"-backed or external, but does that really matter?

@phlogistonjohn brought up different templates for different server configs... if we mandate UUID device specification, there aren't really any templates when using /dev, so they could all go in the same zone. Dev-based zones would require manual creation of Node CRs, so they need not be homogeneous, and the operator wouldn't overwrite them.

jarrpa commented 5 years ago

The configurability seems to have exploded. :)

It's a bad habit of mine. ;) See my openshift-ansible work.

If I back up a step here, I think we want 3 possible configurations for a gluster pod:

Usage of block PV (desired)

Usage of device in /dev (as stopgap until 1 is everywhere)

External nodes I don't think I want to expose directly specifying the PVCs that would be used in (1). Do you have a usecase in mind where that is important?

As always, I'm trying to think about on-prem edge cases, because that's where the challenge is going to lie. I'm thinking of situations where 1. We want to use local block PVs, 2. Dynamic provisioning of local block PVs doesn't exist, and 3. The admin wants to ensure that very specific drives are used. We should also consider the possibility that users will come up with their own reasons we can't think of right now for not wanting to use StorageClasses, in whatever management mode. I can kinda guess at the code path for this when we already have the SC-based path, and it doesn't seem like it wouldn't be bad to implement and maintain.

I think this is really dangerous. The CRs hold the authoritative, declarative state. The only (almost) safe thing I can think of would be for the operator to resolve to UUIDs, then overwrite the sdX w/ UUIDs in the CR. Putting it in the status: doesn't fix the problem, nor does trying the preserve a sdX==>UUID mapping someplace. It works if nothing will ever be changed in the future (add or remove), but breaks as soon as changes are allowed. Complicating this is that the operator modifying the admin's settings is frowned upon (and a race).

I imagine this is something the local block provisioner team has hit. How are they dealing with it? Personally I'm fine with letting the admin put in whatever device path they want, so long as we provide a stern warning somewhere in our spec docs that they REALLY SHOULD use UUIDs. Especially for throw-away environments, let 'em have /dev/sdX.

So we're talking about splitting the "template" functionality from the "failure domain" specification. Is this really necessary? I feel like it's bringing in added room for misconfiguration.

I think it makes things easier to understand. It also splits the functionality so that those who aren't interested in configuring failure domains don't have to do anything with them. If no zone is specified, just assume all nodeTemplates are part of a single default zone.

My default stance is always against artificially constraining configurations "for the user's benefit". It is a disservice to competent and power users who want to do potentially interesting things with our technology. We always have the authority to say that whatever they're trying is unsupported and/or won't be fixed. :)

Leaving them tied together means you can't put "dev"-backed pods in the same domain as "pv"-backed or external, but does that really matter?

I don't think that mandating they be split is of any meaningful benefit or importance.

@phlogistonjohn brought up different templates for different server configs... if we mandate UUID device specification, there aren't really any templates when using /dev, so they could all go in the same zone. Dev-based zones would require manual creation of Node CRs, so they need not be homogeneous, and the operator wouldn't overwrite them.

I'm not sure I quite understand the meaning of "server templates" here. Is that different from the concept I presented of just have different node templates, each with their (one and only one) management mode/type? Also, as above, I see no concrete reason to prevent users from using dev letter names so long as we make it clear in all our examples and documentation that UUIDs are safer.

JohnStrunk commented 5 years ago

I imagine this is something the local block provisioner team has hit. How are they dealing with it? Personally I'm fine with letting the admin put in whatever device path they want, so long as we provide a stern warning somewhere in our spec docs that they REALLY SHOULD use UUIDs. Especially for throw-away environments, let 'em have /dev/sdX.

They are going to permit specifying either way.

I went looking for a good explanation of the options and found this: https://wiki.archlinux.org/index.php/persistent_block_device_naming After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

jarrpa commented 5 years ago

After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

Oof... yeah, all right.

Any chance you can get put together an update to the design doc with the things you're willing to incorporate? Or any chance we can collaborate on a branch somewhere so I can propose commits?

JohnStrunk commented 5 years ago

Any chance you can get put together an update to the design doc with the things you're willing to incorporate? Or any chance we can collaborate on a branch somewhere so I can propose commits?

I'm aiming for tomorrow. I need to get this and the csi driver StorageClass definition done for GCS/0.2.

humblec commented 5 years ago

@JohnStrunk late to the party, but started my review. Will push more, however one high level question I have on our CRs are, how our own topology based CRs play with topology based volume provisioning in Kube/OCP ? Isnt it going to confuse admins?

phlogistonjohn commented 5 years ago

Catching up on this discussion after a while of not having much time to look at it, excuse me if I reiterate anything previously discussed. I tried to read every update but I may not have fully absorbed everything.

After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

Yup. I found the same thing while researching options for a request for similar behavior natively baked into Heketi.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

Since the operator is not the end of the management chain, and we're going to have things like gd2/heketi/etc in the mix my thought is to simply (strongly) recommend persistent devices, but otherwise treat the given devices as a filter. As in "ok gd2/heketi feel free to absorb these devices into your management scheme" and those subsystems should be expected to map whatever currently is /dev/sdX in a persistent manner.

phlogistonjohn commented 5 years ago

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

OK, I'm fairly ignorant ot OLM. I can buy that strategy for most production scenarios but it strikes me as fairly rigid for developing with. I'm fine with leaving it out of the CRD but when it gets time to coding/implementing I hope that there's some mechanism where I can slot in my custom test build and have that get deployed without having to compile the operator from scratch.

JohnStrunk commented 5 years ago

Since the operator is not the end of the management chain, and we're going to have things like gd2/heketi/etc in the mix my thought is to simply (strongly) recommend persistent devices, but otherwise treat the given devices as a filter. As in "ok gd2/heketi feel free to absorb these devices into your management scheme" and those subsystems should be expected to map whatever currently is /dev/sdX in a persistent manner.

I can't figure out how to make this work. Consider:

Scenario 1:
1. Set CR: use sda
2. low level: sda is currently D1... will use D1
3. Set CR: also use sdb
4. low level: sdb is currently D2... will add D2
5. reboot shuffles mapping: sda=D2, sdb=D1
6. Set CR: remove sdb
7. What does low level do? If they remember sdb was D2 when added, we have other problems...
Scenario 2: (with remembering original mapping)
1. Set CR: use sda
2. low level: sda is D1... will use D1
3. reboot shuffles. sda is now D2 and sdb is D1
4. Problem: Admin wants to add D2, but sda is already in the CR.
Scenario 3: Like 2, but w/o tracking, there is a similar problem on remove.

JohnStrunk commented 5 years ago

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

OK, I'm fairly ignorant ot OLM. I can buy that strategy for most production scenarios but it strikes me as fairly rigid for developing with. I'm fine with leaving it out of the CRD but when it gets time to coding/implementing I hope that there's some mechanism where I can slot in my custom test build and have that get deployed without having to compile the operator from scratch.

My plan here is that there would be a text manifest in the container image. For development, you are free to adjust these however you see fit. A release, however, contains a fixed version of the manifest which points to a very specific set of images that define "version X".

JohnStrunk commented 5 years ago

@JohnStrunk late to the party, but started my review. Will push more, however one high level question I have on our CRs are, how our own topology based CRs play with topology based volume provisioning in Kube/OCP ? Isnt it going to confuse admins?

Unifying would be very cool. Do you have suggestions here?

The topology stuff I've seen seems very cloud provider specific. The current link is that (AWS example)... If you have a multi-AZ cluster, you would want to create 3 "zones" for your Gluster cluster, specifying nodeAffinity such that it would put each zone into a specific AWS AZ, and the StorageClass for that zone's PVs would have their affinities set to grab EBS volumes from the given zone. From my pov, you'd want to name your gluster zone "east-1a" if it's going to be running in AWS AZ "east-1a". I think that would make it fairly straightforward.

JohnStrunk commented 5 years ago

Progress update: We discussed the CRD in the operator meeting this morning.

Current scratch pad: https://gist.github.com/JohnStrunk/63fd6141a25ca959e49b9c99758eeff7
- This has the current draft CRDs & no text. I was having trouble keeping up w/ both.
Changes:
- All secrets have an optional namespace. I personally think their use should be discouraged, but I don't think It'll hurt anything
- We now just have GlusterCluster and GlusterNode
- Zones have been replaced w/ nodeTemplate (list) in the GC, and zone groupings are now just a text label
- The template-based config that used to have manual and automatic has been combined. You can now just set a fixed number of nodes via nodes or you can use the other 4 thresholds to allow autosizing. maxStorage has been removed since it was redundant.
- GlusterNodes can be created either by the operator as a result of the templates or manually by the admin.
- templates must use PVs
- Template-based GNs will have an annotation so we can tell them apart from admin-created. Template-based will have config changes reverted by the operator.
- GN storage is now a list, where items are devices or PVCs. Yes, you can mix/match; you probably shouldn't
- GNs that use device paths must:
  - Ensure the path is stable
  - Apply a nodeAffinity that matches exactly 1 node

I'm going to make another pass through the comments, then try and update the real doc.

JohnStrunk commented 5 years ago

I know you thought it'd never happen. :skull: But I managed to update this PR.

JohnStrunk commented 5 years ago

Anyone else?

gluster / anthill

Add a proposal for the Gluster CRDs #40