[BUG] Schema Definitions returns a 503 for certain resource types

MbolotSuse commented 3 months ago

Rancher Server Setup

Rancher version: v2.9-head (c9be13b09329bbee60a5f6419d500198f83c44d1
Installation option (Docker install/Helm Chart): Docker install
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: v1.27.10+k3s2
Cluster Type (Local/Downstream): Local
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): N/A

User Information

What is the role of the user logged in? (Admin)
- If custom, define the set of permissions: N/A

Describe the bug When attempting to get the schema definitions for certain resources, specific resources always return a 503, even after a significant amount of time has passed. Below is the list of resources provided by @Priyashetty17:

management.cattle.io.userattribute              
management.cattle.io.globaldnsprovider            
management.cattle.io.catalog                 
project.cattle.io.app                    
management.cattle.io.user                  
management.cattle.io.nodetemplate              
generateKubeconfigOutput                   
management.cattle.io.projectnetworkpolicy          
management.cattle.io.template                
management.cattle.io.rkeaddon                
management.cattle.io.podsecuritypolicytemplate        
management.cattle.io.clustertemplate             
management.cattle.io.templatecontent             
management.cattle.io.podsecuritypolicytemplateprojectbinding 
management.cattle.io.rancherusernotification         
management.cattle.io.templateversion             
management.cattle.io.composeconfig              
project.cattle.io.apprevision                
management.cattle.io.dynamicschema              
management.cattle.io.authconfig               
management.cattle.io.clustertemplaterevision         
management.cattle.io.nodepool                
management.cattle.io.multiclusterapprevision         
management.cattle.io.catalogtemplateversion         
management.cattle.io.token                  
management.cattle.io.rkek8ssystemimage            
management.cattle.io.kontainerdriver             
management.cattle.io.globaldns                
management.cattle.io.clustercatalog             
management.cattle.io.etcdbackup               
management.cattle.io.nodedriver               
management.cattle.io.rkek8sserviceoption           
management.cattle.io.group                  
management.cattle.io.projectcatalog             
management.cattle.io.groupmember               
management.cattle.io.catalogtemplate             
management.cattle.io.samltoken                
management.cattle.io.node                  
management.cattle.io.multiclusterapp

To Reproduce

Run rancher/rancher:v2.9-head
Retrieve the schemaDefinition for any of the types above.

Result A 503 response is returned.

Expected Result An object should be returned, with the same fields provided by the 2.8 schema object.

Screenshots N/A

Additional context It is possible that some of these objects are being seen as "abitrary" objects since the definitions are minimal.

richard-cox commented 2 months ago

I see the same for monitoring.coreos.com.alertmanagerconfig, brought in via the monitoring app. It should work for non-core rancher crds?

MbolotSuse commented 2 months ago

@richard-cox Keep in mind that a temporary 503 is expected behavior. When a new CRD is added there may be some time between when the installation occurs and when the definition is available. However, this should resolve shortly (matter of seconds). This issue is about resources that are constant 503s that never resolve - that part is a bug. Since the resource type that you mentioned here isn't available in core rancher, it is subject to possible, temporary 503.

That being said, trying it on my local it looks like that resource is part of the indefinite 503s, so I'll aim to fix it as part of this ticket.

This ticket will only aim to fix the permanent 503. LMK if you have any follow-up questions.

richard-cox commented 2 months ago

Unfortunately the 503 isn't temporary and is always returned. I didn't try on the local cluster, but see it on a downstream one.

/k8s/clusters/<cluster id>/v1/schemas/monitoring.coreos.com.alertmanagerconfig returns ok
/k8s/clusters/<cluster id>/v1/schemadefinitions/monitoring.coreos.com.alertmanagerconfig always 503's

MbolotSuse commented 2 months ago

Issue Summary

The schema definitions has a few bugs that can cause the definition to not be available for a schema, or to not be accurate for a schema:

Resources which were in one version of the group (e.x. v1alpha1) but not in the preferred version of the group (e.x. v1) were not included in the model cache used by the definition handler. This caused 503 errors since there was a schema for these resources, but there was no accessible model (so no definition). monitoring.coreos.com.alertmanagerconfig appears to be one of these resources - it's in the v1alpha1 group, but missing from the v1 group, and the v1 group is the preferred group. In these cases, steve should have the definition from the group that's present, even though it's not the preferred group.
Some resources are rancher-defined and don't represent real kubernetes resources. These resources should still have a definition, derived from what Rancher specifies during the start process. generateKubeconfigOutput is one of these resources.
Some resources don't have any fields specified in the CRD definition. This causes the handler to view these objects as a proto.Map instead of a proto.Kind, and since we only process proto.Kind objects (which represent root/top-level objects) we don't produce a definition for these objects, resulting in the 503 error. This is likely the root cause of the 503 on many of the management.cattle.io types, including management.cattle.io.userattribute. This will mostly be addressed by the fix for #45157.

MbolotSuse commented 2 months ago

Validation Template

Root Cause

As explained in the above issue summary, there were a few issues with how schema definitions were formed.

For cases where a resource (e.x monitoring.coreos.com.alertmanagerconfig) existed in one version of a group (e.x. v1alpha1) but not in the preferred version of the group (e.x. v1), we would return a 503 error, and not produce a definition for the schema.
There are some resources which aren't real kubernetes resources (e.x. counts, and generateKubeConfigOutput). These items would return a 503 error and not produce a definition when the definition was requested for that type.

Note that the 3rd issue mentioned in the summary was not fixed here.

What was fixed, or what change have occurred

The logic for preferred versions was tweaked slightly. In the new state:
- If there is a model/schema for this resource in the preferred group/version, we use the version in the preferred group/version.
- If the group has no preferred version, we use the first version of this resource that we find.
- If the group has a preferred version, but the specific resource isn't in that version (for example, because it was removed), then we use the first version of this resource that we find.
We now add the "baseSchemas" to the schema definition handler. These will contain types like generateKubeconfigOutput which aren't i the openapi doc.
- When a schema is requested that is a part of this baseSchema, we parse the definitions from the resource fields on the schema.
- This approach means that these schemas will still have resource field when viewed through /v1/schemas.
- Note that this is very basic - so recursively defined structs will not have complete definitions (e.x. counts has a type map[string]itemCount. This appears to function the same as 2.8 (I did not find a distinct schema for itemCount on 2.8)

Areas or cases that should be tested

Cases where a resource is not in the preferred version of the group. The monitoring.coreos.com.alertmanagerconfig resource (which is available in the rancher-monitoring chart) is a good example.
Cases where a resource is in the preferred version, but there's another version of the CRD. The definition should reflect the resource in the preferred version (ideally this version should also be "later" in the list).
Cases of Rancher CRDs migrated to the RK API (e.x. GlobalRoles). These should have the full definition as seen in the openapi/v2 schema, and should not take any information from the baseSchemas. Note that you will still see 503 errors for older type resources (like Users).

What areas could experience regressions

Schema definitions for the above cases, and for the previously working endpoints.

Are the repro steps accurate/minimal?

N/A - see the original issue for more details.

Priyashetty17 commented 1 week ago

Validated with v2.9-ddabf6b2266255276352beb1eeb740d9e4de802d-head

rancher / rancher