operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

operate first a pulp+python_plugin instance (or more) #176

Closed goern closed 2 years ago

goern commented 3 years ago
Feature: op1st is operating one Pulp instance

Scenario: multi-index pulp
  Given we can deploy Pulp via an Operator and kustomize manifests 
  When pip installed a module from it
  And there are multiple variants of the same package version
  Then pip should separate the multiple variants by multiple index url

Scenario: use RBAC on multi-index pulp
  Given I have access to Pulp
  When I publish a module to an index URL
  And when I am not the 'owner' of that index
  Then I should be denied from publishing the module

Scenario: publish module
  Given I am a Tekton pipeline user
  When I publish a module to an index url
  Then I see the module on the simple index

@harshad16 @fridex @tumido this needs refinement

@fridex could you add the Pulp team?

tumido commented 3 years ago

I'm gonna move this around a bit.

fridex commented 3 years ago

@fridex could you add the Pulp team?

CC @ipanova @fao89 @dralley

Feel free to add others you find relevant. As discussed in the meeting, we would like to deploy pulp with the pulp_python plugin.

fao89 commented 3 years ago

adding @mikedep333 as he is the SME on https://github.com/pulp/pulp-operator

fao89 commented 3 years ago

we currently provide 3 ways of installing pulp:

We have a brief explanation of them here: https://pulpproject.org/installation-introduction/

tumido commented 3 years ago

Ohh, long time no see, Pulp team! :slightly_smiling_face: How are you doing these days? :slightly_smiling_face: Welcome!

I think the pulp-operator serves our purpose the best. I think we'd like to be abstracted from the internals of Pulp as much as possible. If the experience of running an operator in active development in a prod-like environment would benefit the Pulp team, I see that as a plus as well.

mikedep333 commented 3 years ago

Hi @tumido,

We would love for you to adopt pulp-operator.

What internals do you see as important / remaining to be abstracted away?

4n4nd commented 3 years ago

Hey @mikedep333, we do have one other operator (https://github.com/observatorium/operator) deployed which is in active development. We have set up the crds/clusterroles/bindings in a central location here and other required resources in a separate directory like here. You should be able to follow the same structure for setting up the pulp-operator. If you have any suggestions or questions please lmk.

tumido commented 3 years ago

@mikedep333

What internals do you see as important / remaining to be abstracted away?

I don't think there's anything remaining to be abstracted away in the case of the operator. That's why I prefer it as the solution here. :slightly_smiling_face: I think we may get an idea on what might be improved once we start using it. Right now my comment was directed mostly to comparison of the 3 methods @fao89 outlined above - the operator is abstracting away tons of complexity compared to the other installers and is declarative. And we can appreciate that.

I'm gonna go ahead and start creating a namespace for the operator to live at - and we will automate this as a custom deployment of the operator (custom meaning directly deploying the Deployment resource, creating service account and so on) into this new namespace - similar to the observatorium operator @4n4nd linked above.

I'm also gonna create a new user group for you with full access to this new namespace so you can manage and monitor the operator yourself if you want.

The deployment of the operator will be managed via ArgoCD using the manifests copied/referenced from here: https://github.com/pulp/pulp-operator/tree/main/deploy

Once the operator is available in the community operator hub we can either switch to a deployment from there or keep using a custom "manual" deployment for more rapid dev cycles on it if you want.

fao89 commented 3 years ago

@tumido I'm a noob on k8s world, I never worked with ArgoCD. I have a "CI knowledge" of pulp-operator, meaning I only used pulp-operator on these cases: https://github.com/pulp/pulp-operator/actions/runs/728145612 But I see a great opportunity for us to improve our docs: https://pulp-operator.readthedocs.io/en/latest/ Let me know how can I help or at least what you are missing from the docs

fridex commented 3 years ago

Just a friendly ping here. What is the current state of this? We are monitoring this work on the package index meeting with Pulp team. Thanks in advance.

CC @ipanova

tumido commented 3 years ago

Yeah, sorry we had no upgrade on this so far, we've got hammered by a ton of work elsewhere. @fridex

I see the operator didn't reach OperatorHub yet but you have CSV available. It also seems to me that the cluster role/role specified in the direct manifests is not yet prepared for an AllNamespaces role and if we deploy the operator this way the scope is limited to its current namespace only, is that a correct observation?

@fridex do you want to have the operator namespace scoped only within it's own namespace or available to multiple namespaces? I assume you'd rather to have the operator available globally, is that correct? If so, we either have to change the direct manifests a bit or create our own operator catalog source image and install via CSV.

fridex commented 3 years ago

@fridex do you want to have the operator namespace scoped only within it's own namespace or available to multiple namespaces? I assume you'd rather to have the operator available globally, is that correct? If so, we either have to change the direct manifests a bit or create our own operator catalog source image and install via CSV.

Ideally, the operator could be available globally. Short-term, it would be great for us to have just one instance of pulp in one namespace for a selected group of people, small steps could work here. The very first outcome for us is the fact we can run pulp on op1st and can experiment with features it provides to us. The cluster-scope operator can be done in parallel (low priority for us now).

tumido commented 3 years ago

I'm sorry for the constant delays on this. I'm prioritizing this now I hope I can it something in place in few days.

tumido commented 3 years ago

Hey folks, so.. I can offer you 2 options. I think it's up to you to decide which way is more maintainable for you. Note - either of these solutions is temporary. Once you submit your operator to OperatorHub, this model changes - we would consume the operator manifest via subscription from community-operators.

Option 1 - Direct manifests

Implemented in https://github.com/operate-first/apps/pull/663

Pulp team would need to track for changes all the CustomResourceDefinitions, ClusterRoles.. basically any resource defined in that PR.

  1. You would need to maintain it and update all the manifests defined within cluster-scope/ path it in our repos. Basically copy and paste those resources back here if they change in your repos.
  2. The resources within pulp-operator/ path in that PR are transferable and can be deployed from any repo since they are namespace scoped and you already have full control over the pulp-operator namespace.

Option 2 - Install via OLM via a custom catalog

Implemented in https://github.com/operate-first/apps/pull/664

This PR is based on your ClusterServiceVersion and defines a CatalogSource. Right now it points to my image, but the intention is that you own this image and keep the content of the catalog updated - Every time you change the CSV in your repos, you also update the catalog image. You can either use your own catalog or base it on my catalog I've created for this purpose.

My custom catalog is available for you, there's even an updater script that will keep the catalog up to date with pulp-operator repository master branch. Once you push an updated catalog image, the rest of the update in cluster happens automatically.

This option is much easier to migrate once you submit your operator to OperatorHub since we would just point the Subscription resource to a different catalog.

Summary

The decision is up to you, both approaches are valid. Either you want to maintain an OLM catalog for your dev purposes (you already have CSV up to date, so the overhead is not that big) or you'd rather copy and paste the cluster-scoped resources into our repository via PRs. Either is fine with us I think. :slightly_smiling_face:

cc @fao89 @fridex @ipanova

fao89 commented 3 years ago

we are planning to submit our operator to OperatorHub, so I would vote option 2

tumido commented 3 years ago

cc @HumairAK @4n4nd are we also good on using the custom catalog/subscription for the time being (until the pulp operator reaches OperatorHub)?

4n4nd commented 3 years ago

yeah using the custom catalog/subscription sounds good to me :+1:

tumido commented 3 years ago

@fao89 @fridex

Pulp operator is available at cluster scope. It's operated from the pulp-operator namespace which is owned by the pulp user group:

image

Operator is up an running. @fridex can you please try deploying any Pulp* CR in any of your namespaces to see if it works?

Also.. a quick thought, @fridex do you want to have access to the pulp-operator namespace as well? (So you can access the operator logs or what not in case you need it..)

fridex commented 3 years ago

@fao89 @fridex

Pulp operator is available at cluster scope. It's operated from the pulp-operator namespace which is owned by the pulp user group:

Awesome, thanks for the work 👍🏻

Operator is up an running. @fridex can you please try deploying any Pulp* CR in any of your namespaces to see if it works?

I tried to provision Pulp in thoth-test-core namespace. It looks like only postgres was provisioned:

Screenshot_2021-05-24_11-59-22

Also.. a quick thought, @fridex do you want to have access to the pulp-operator namespace as well? (So you can access the operator logs or what not in case you need it..)

That might be good, but not essential as I do not have pulp expertise. Would it be possible to onboard Pulp team representatives @ipanova and/or @fao89?

tumido commented 3 years ago

@mikedep333 and @fao89, you already have access: https://console-openshift-console.apps.zero.massopen.cloud/k8s/cluster/projects/pulp-operator

@ipanova wanna be added as well? :slightly_smiling_face:

tumido commented 3 years ago

@fridex here's the operator log for you, it seems to be failing due to quota on that namespace:

{\"ansible_loop_var\": \"item\", \"changed\": false, \"error\": 403, \"item\": \"redis\", \"msg\": \"Failed to create object: b'{\\\"kind\\\":\\\"Status\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"persistentvolumeclaims \\\\\\\\\\\"example-pulp-redis-data\\\\\\\\\\\" is forbidden: exceeded quota: thoth-test-core-custom, requested: requests.storage=1Gi, used: requests.storage=40Gi, limited: requests.storage=40Gi\\\",\\\"reason\\\":\\\"Forbidden\\\",\\\"details\\\":{\\\"name\\\":\\\"example-pulp-redis-data\\\",\\\"kind\\\":\\\"persistentvolumeclaims\\\"},\\\"code\\\":403}\\\\n'\", \"reason\": \"Forbidden\", \"status\": 403}\u001b[0m\n\r\nPLAY RECAP 

You probably want to request an increase by changing the manifest here: https://github.com/operate-first/apps/blob/master/cluster-scope/base/core/namespaces/thoth-test-core/resourcequota.yaml

Full operator log attached.

log.txt

fridex commented 3 years ago

Thanks @tumido. Long-term, it would be probably better to separate Pulp. Would it be possible to create a separate namespace for pulp experiments so others involved have also access to it? Please do let me know if there is already a procedure for it.

tumido commented 3 years ago

@fridex sure, you can follow this doc here: https://www.operate-first.cloud/users/support/docs/onboarding_to_cluster.md

You can open an issue in this repo using an onboarding template or DIY via a PR

For PR you can either use our onboarding script in the apps repo, as described in the doc linked above, or give a try to our brand new cli tool if you feel adventurous).

  1. create namespace resource and assign owners
    cd operate-first/apps
    # via script
    scripts/onboarding.sh thoth-pulp-experiments thoth
    # or via cli
    opfcli create-project thoth-pulp-experiments thoth
  2. set quota (this is still manual right now), please follow https://github.com/operate-first/apps/blob/master/docs/cluster-scope/add_resource_quotas.md
  3. Import namespace to target cluster by editing cluster-scoped/overlays/moc/zero/kustomization.yaml and adding a new line to the resources list.

That should be all..

fridex commented 3 years ago

@tumido thanks for the manual - it looks like I would require gpg keys, @harshad16 was open to make this happen so I leave it to pros :) opened https://github.com/operate-first/support/issues/252 to track this

Thanks 👍🏻

tumido commented 3 years ago

@fridex your namespace should be available now at https://console-openshift-console.apps.zero.massopen.cloud/k8s/cluster/projects/thoth-pulp-experiments

LMK if it works properly now :slightly_smiling_face:

fridex commented 3 years ago

@tumido awesome, thanks! I was able to access the namespace and deploy pulp (partially). The operator started postgres and redis, but not pulp itself. Might be a resource limitation again? I see just 2 CPUs available in medium. Could you please check this? Thanks! 👍🏻

Screenshot_2021-05-26_13-26-29

tumido commented 3 years ago

Yeah, you're hitting quota again. You're creating a Pulp instance with 50Gi storage + redis allocates 1Gi + postgres allocates 8Gi -> 59Gi in total. Your quota is 40Gi now. I'll increase from 40Gi it to 60Gi for you on that namespace and we'll see where that goes. :slightly_smiling_face:

*****************************************************\r\n\u001b[1;30mtask path: /opt/ansible/roles/pulp-api/tasks/main.yml:15\u001b[0m\n\u001b[0;36mskipping: [localhost] => {\"changed\": false, \"skip_reason\": \"Conditional result was False\"}\u001b[0m\n\r\nTASK [pulp-api : pulp-file-storage persistent volume claim] ********************\r\n\u001b[1;30mtask path: /opt/ansible/roles/pulp-api/tasks/main.yml:20\u001b[0m\n\u001b[0;35m[DEPRECATION WARNING]: evaluating 'file_storage' as a bare variable, this \u001b[0m\r\n\u001b[0;35mbehaviour will go away and you might need to add |bool to the expression in the\u001b[0m\r\n\u001b[0;35m future. Also see CONDITIONAL_BARE_VARS configuration toggle. This feature will\u001b[0m\r\n\u001b[0;35m be removed in version 2.12. Deprecation warnings can be disabled by setting \u001b[0m\r\n\u001b[0;35mdeprecation_warnings=False in ansible.cfg.\u001b[0m\n\u001b[0;31mfailed: [localhost] (item=pulp-file-storage) => {\"ansible_loop_var\": \"item\", \"changed\": false, \"error\": 403, \"item\": \"pulp-file-storage\", \"msg\": \"Failed to create object: b'{\\\"kind\\\":\\\"Status\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"persistentvolumeclaims \\\\\\\\\\\"example-pulp-file-storage\\\\\\\\\\\" is forbidden: exceeded quota: medium, requested: requests.storage=50Gi, used: requests.storage=9Gi, limited: requests.storage=40Gi\\\",\\\"reason\\\":\\\"Forbidden\\\",\\\"details\\\":{\\\"name\\\":\\\"example-pulp-file-storage\\\",\\\"kind\\\":\\\"persistentvolumeclaims\\\"},\\\"code\\\":403}\\\\n'\", \"reason\": \"Forbidden\", \"status\": 403}\u001b[0m\n\r\nPLAY RECAP 
tumido commented 3 years ago

Ok, now it seems to be complaining due to a different issue:

[pulp-api : Store admin password] *****************************************\r\n\u001b[1;30mtask path: /opt/ansible/roles/pulp-api/tasks/admin_password_configuration.yml:45\u001b[0m\n\u001b[0;32mok: [localhost] => {\"ansible_facts\": {\"admin_password\": \"Lh8NWJp2bUkqSHToZN28xw0Xzvm28ix4\"}, \"changed\": false}\u001b[0m\n\r\nTASK [pulp-api service] ********************************************************\r\n\u001b[1;30mtask path: /opt/ansible/roles/pulp-api/tasks/main.yml:77\u001b[0m\n\u001b[0;31mfailed: [localhost] (item=pulp-api) => {\"ansible_loop_var\": \"item\", \"changed\": false, \"error\": 422, \"item\": \"pulp-api\", \"msg\": \"Failed to create object: b'{\\\"kind\\\":\\\"Status\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"metadata\\\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"Service \\\\\\\\\\\"example-pulp-api-svc\\\\\\\\\\\" is invalid: spec.ports[0].nodePort: Invalid value: 24817: provided port is not in the valid range. The range of valid ports is 30000-32767\\\",\\\"reason\\\":\\\"Invalid\\\",\\\"details\\\":{\\\"name\\\":\\\"example-pulp-api-svc\\\",\\\"kind\\\":\\\"Service\\\",\\\"causes\\\":[{\\\"reason\\\":\\\"FieldValueInvalid\\\",\\\"message\\\":\\\"Invalid value: 24817: provided port is not in the valid range. The range of valid ports is 30000-32767\\\",\\\"field\\\":\\\"spec.ports[0].nodePort\\\"}]},\\\"code\\\":422}\\\\n'\", \"reason\": \"Unprocessable Entity\", \"status\": 422}\u001b[0m\n\r\nPLAY RECAP 

Operator is creating example-pulp-api-svc service and complains that nodePort has invalid value of 24817 (valid range is 30000-32767). Any idea where that comes from?

@fridex is deploying pulp manifest that looks like this:

apiVersion: pulp.pulpproject.org/v1beta1
kind: Pulp
metadata:
  name: example-pulp
  namespace: thoth-pulp-experiments
spec:
  route_tls_termination_mechanism: Edge
  loadbalancer_port: 80
  image_pull_policy: IfNotPresent
  image_web: pulp-web
  file_storage:
    access_mode: ReadWriteMany
    size: 50Gi
  project: pulp
  tag: latest
  image: pulp
  loadbalancer_protocol: http
  registry: quay.io
  storage_type: File

any idea what's going on? @fao89 @ipanova

operator log attached: log.txt

fao89 commented 3 years ago

@tumido I'm not familiar with OCP, but we expand the nodeport range on minikube

minikube start --vm-driver=docker --extra-config=apiserver.service-node-port-range=80-32000

https://github.com/pulp/pulp-operator/blob/main/.github/workflows/ci.yml#L37

tumido commented 3 years ago

Oh, sorry to hear that, I don't have a good news for you than. :disappointed:

In OCP this is more complicated. You need to ensure the port range is available and allowed in any underlying OCP provider (AWS, GCP, bare metal) and not blocked by any firewall on the infra level and above. Then you can go and change the range for the whole cluster network.

Hm, I don't think you want your operator to be that much opinionated about how is the cluster set up and how the infrastructure beneath OCP behaves. In other words, you'd need to make this settings a prerequisite for Pulp, since this goes even beyond cluster-admin permissions.

tumido commented 3 years ago

I've noticed this if statement. Is there a way to workaround this and make it so it doesn't bind to a node port? (or am I missing something there..)

https://github.com/pulp/pulp-operator/blob/221c7652118d6c1c6dcda785fe5d651f14e0b101/roles/pulp-api/templates/pulp-api.service.yaml.j2#L25

fao89 commented 3 years ago

Oh, sorry to hear that, I don't have a good news for you than.

In OCP this is more complicated. You need to ensure the port range is available and allowed in any underlying OCP provider (AWS, GCP, bare metal) and not blocked by any firewall on the infra level and above. Then you can go and change the range for the whole cluster network.

Hm, I don't think you want your operator to be that much opinionated about how is the cluster set up and how the infrastructure beneath OCP behaves. In other words, you'd need to make this settings a prerequisite for Pulp, since this goes even beyond cluster-admin permissions.

@mikedep333 @dkliban @mdellweg ^

I've noticed this if statement. Is there a way to workaround this and make it so it doesn't bind to a node port? (or am I missing something there..)

https://github.com/pulp/pulp-operator/blob/221c7652118d6c1c6dcda785fe5d651f14e0b101/roles/pulp-api/templates/pulp-api.service.yaml.j2#L25

I think we can do it, wdyt @chambridge ?

@tumido could you please file an issue? https://github.com/pulp/pulp-operator#how-to-file-an-issue

tumido commented 3 years ago

@fao89 here's the ticket https://pulp.plan.io/issues/8833

(In the end I had to create the Plan account I was refusing to last time, lol.. :smile: )

Btw. I've also noticed one more issue caused by the nodePort usage in this service (also described in the issue) - even if we bind a node port from the allowed range, it will still make Pulp API service basically a singleton for the cluster. NodePort means a "physical" port on the nodes, so it can be bound only to one service on the cluster. This results in no other Pulp resource anywhere on the cluster being able to deploy successfully if there's already a Pulp api server present.

chambridge commented 3 years ago

If you are deploying on OpenShift I'd suggest using

ingress_type: Route
route_tls_termination_mechanism: Edge

There was a recent PR that went in to fix the nodeport templating so its only used when that ingress_type is specified: https://github.com/pulp/pulp-operator/pull/146

I think we could leverage https://pulp.plan.io/issues/8833 to make the "default" values in a more friendly range and updated the CRD so it could be configurable (so multiple nodeport pulp instances can be deployed)

fao89 commented 3 years ago

@tumido with @chambridge help I was able to successfully deploy pulp, I tried to document it here: https://github.com/pulp/pulp-operator/pull/149

fridex commented 3 years ago

I've just tried to provision pulp in thoth-pulp-experiments namespace, it looks like the issue reported above still persists (only postgres and redis are up). I see the referenced PR is merged, is there anything else blocking us from provisioning the instance? Thanks 👍🏻

tumido commented 3 years ago

@fridex I had to update the custom catalog with Pulp operator to get the update propagated to the cluster. The operator progressed and updated.

I think a better sollution in long run is to switch to the upstream catalog @fao89 mentioned in his docs. Once https://github.com/operate-first/apps/pull/703 merges we can reinstall the operator and you can try it again. And if Pulp team updates their catalog with an update/replacement CSV, the operator should progress automatically.

Using the official catalog, you should be able to follow @fao89's guide here: https://github.com/pulp/pulp-operator/blob/main/docs/quickstart.md

tumido commented 3 years ago

@fridex I'll also add you to the pulp-operator namespace, so you can observe the operator logs for yourself as well. :slightly_smiling_face:

tumido commented 3 years ago

@fridex we've now switched to the "official" catalog image via https://github.com/operate-first/apps/pull/703

I've reinstalled the operator on the cluster, so now it should be up to date.

You also have access to pulp-operator namespace now.

fridex commented 3 years ago

Thanks 👍🏻

I tried to provision pulp once again, but still no success:

From the operator logs I can see:

task path: /opt/ansible/roles/pulp-api/tasks/main.yml:77
failed: [localhost] (item=pulp-api) => {"ansible_loop_var": "item", "changed": false, "error": 422, "item": "pulp-api", "msg": "Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Service \\\\\"example-pulp-api-svc\\\\\" is invalid: spec.ports[0].nodePort: Invalid value: 24817: provided port is not in the valid range. The range of valid ports is 30000-32767\",\"reason\":\"Invalid\",\"details\":{\"name\":\"example-pulp-api-svc\",\"kind\":\"Service\",\"causes\":[{\"reason\":\"FieldValueInvalid\",\"message\":\"Invalid value: 24817: provided port is not in the valid range. The range of valid ports is 30000-32767\",\"field\":\"spec.ports[0].nodePort\"}]},\"code\":422}\\n'", "reason": "Unprocessable Entity", "status": 422}
ipanova commented 3 years ago

is not this related to this issue https://pulp.plan.io/issues/8833? @chambridge @mikedep333 can you please keep an eye on this thread while @fao89 is on PTO? What needs to be done to progress with 8833?

chambridge commented 3 years ago

I'm happy to help jump on and debug if you want to give me access. I'm interested in the CR being used for the deploy as well as the logs. I wouldn't think you would be deploying where nodeport would be in use so I'm a bit confused how you're hitting the issue. 8833 could worked for this but really shouldn't be necessary for an OpenShift deployment.

chambridge commented 3 years ago

Can some one send me login information? I assume the above granted me login and RBAC privileges.

4n4nd commented 3 years ago

@chambridge you shooould be able to use your @redhat.com account to log in

tumido commented 3 years ago

@chambridge Use https://console-openshift-console.apps.zero.massopen.cloud/ and select MOC SSO - then use your RH account via Google auth provider.

chambridge commented 3 years ago

The postgres & redis pods are currently stuck in pending waiting for PVCs to get created. Is anyone able to help with that?

image

image

4n4nd commented 3 years ago

@chambridge looks like the current default storageclass isn't working for some reason, as a workaround could you please try using ocs-storagecluster-cephfs storageclass for now?

chambridge commented 3 years ago

Unfortunately, while postgres had a custom resource field for this redis did not. I have the following PR out to add this capability into the operator.

4n4nd commented 3 years ago

@chambridge the issue with the storageclass should be resolved and your PVCs should be provisioned.

tumido commented 3 years ago

Pulp is hitting the quota again (this is expected, since we're not familiar with the requirements). After a chat with @chambridge we decided to bump the namespace quota to 8 CPUs for now and see where that leads us.