fix(log): Some improvements to the logs

james-w commented 1 month ago

The current logs are somewhat confusing, and missing some very useful information. This is an attempt to clean it up in a few ways:

Log the number of items from each dg. Each datagatherer can return the number of items it collected so that the logs tell us a bit more about what was found in the cluster, and can help find where any items have been missed.
Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.
Remove the "Running Agent" log message, it's not clear what it means
Put [] around the body as if the body is empty the log message can read like the following message is the body.

2024/05/10 16:33:43 retrying in 25.555756126s after error: received response with status code 404. Body:
W0510 16:33:43.832278   10875 reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list route.openshift.io/v1, Resource=routes: the server could not find the requested resource

Remove "using parent stop channel" from a log message as it's not clear what it means. It used to be conditional, and now it isn't, so it's not providing any information, and it's confusing as to what it might be trying to tell you if you aren't familiar with that code.

There are couple of code cleanups included as well:
Fix a crash bug when REST config can't be loaded. discoveryClient is uninitialized, so the pointer can't be deferenced, so return an empty struct if there's an error.
Remove the informer context in the dynamic dg. The informer context is created, but not passed to anything, so this code doesn't do anything.

The changes are split into a few logical commits if we want to break it up and review them independently, as I know these might not be the best approaches for all of these things.

maelvls commented 1 month ago

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

~I wasn't able to reproduce the agent giving up~ I was able to reproduce and test that the last log line is the reason why the agent stopped by using a dummy key and waiting for the 15 min backoff to hit:

$ export HTTPS_PROXY=foo
$ go run . agent -c config.yaml --client-id XXXXX -k /tmp/key --venafi-cloud -p 5s 2>&1 | grep api.venafi.cloud
2024/05/14 14:26:57 Posting data to: https://api.venafi.cloud/
2024/05/14 14:26:58 retrying in 41.388139367s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:27:39 Posting data to: https://api.venafi.cloud/
2024/05/14 14:27:40 retrying in 35.360750201s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:28:15 Posting data to: https://api.venafi.cloud/
2024/05/14 14:28:15 retrying in 1m38.081853176s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:29:53 Posting data to: https://api.venafi.cloud/
2024/05/14 14:29:54 retrying in 1m37.808502147s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:31:31 Posting data to: https://api.venafi.cloud/
2024/05/14 14:31:32 retrying in 3m17.774884905s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:34:50 Posting data to: https://api.venafi.clou
2024/05/14 14:34:50 retrying in 3m34.920269392s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:38:25 Posting data to: https://api.venafi.cloud/
2024/05/14 14:38:25 Exiting due to fatal error uploading: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}

(I've hidden all the useless messages about missing resources)

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

Anyways, I think all the changes you made are sensible.

james-w commented 1 month ago

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Thanks!

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

I wasn't able to reproduce the agent giving up, or I haven't waited long enough, I guess the backoff threshold is high:

I believe the default timeout is 15m before it gives up.

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

It was. Running locally without a kubeconfig file should reproduce.

Anyways, I think all the changes you made are sensible.

Thanks.

james-w commented 1 month ago

Thanks all, I pushed an update with the suggested changes.

tfadeyi commented 1 month ago

Thank you :+1: LGTM, I'll merge the changes to master

jetstack / jetstack-secure

fix(log): Some improvements to the logs #537