tektoncd / dashboard

A dashboard for Tekton!
Apache License 2.0
877 stars 266 forks source link

Logs Persistence Fallback not Loading #2196

Closed dasxx17 closed 3 years ago

dasxx17 commented 3 years ago

Describe the bug

Not sure if this is related to the lack of the open/download links feature for external logs in v0.19.0, but when configuring the fallback I am not seeing the logs appear again in my dashboard. I have the external-logs flag configured as per the tutorial:

But I am still seeing the Unable to fetch logs message when browsing tasks. I have verified that the logs are indeed appearing in my S3 bucket Minio server. All the pods are looking healthy (operator, server, fluentbit, fluentd), with the exception of some restarts on the server pod Normal Pulled 61s kubelet Container image "node:14" already present on machine.

I do not believe this would impact it, but I did deploy an ALB Ingress Controller to expose the logs-server deployment. Thanks for your help!

Steps to reproduce the bug

Follow https://github.com/tektoncd/dashboard/blob/main/docs/walkthrough/walkthrough-logs.md

Environment details

Kubernetes version: 1.18 Cloud-provider/provisioner: EKS Versions: Tekton Dashboard: 0.18.1 Tekton Pipelines: 0.26.0 Tekton Triggers: 0.15.0 Install namespaces: Tekton Dashboard: tekton-pipelines Tekton Pipelines: tekton-pipelines Tekton Triggers: tekton-pipelines

dasxx17 commented 3 years ago

CC @AlanGreene (have not tested with the latest PR, not sure the quickest way to get the API updates from HEAD without building it)

AlanGreene commented 3 years ago

@dasxx17 there have been a few changes to external logs since v0.18. You could try the latest nightly build which includes the fix for the open/download links: https://storage.googleapis.com/tekton-releases-nightly/dashboard/latest/tekton-dashboard-release.yaml

Let me know if that resolves the issue for you. I'll most likely be releasing v0.20 on Monday so that build should be pretty close to what will be included.

AlanGreene commented 3 years ago

There was an issue in at least one scenario where v0.19 wasn't correctly proxying requests to the external logs service that has since been fixed.

v0.20 is out now. I re-tested the external logs walkthrough in a clean environment with Dashboard v0.20 and it's working as expected. 🤞

If the new release doesn't resolve the problem for you we can reopen this issue.

dasxx17 commented 3 years ago

@AlanGreene thanks for the reply. I applied v0.20 and deleted my entire tools namespace. I then redid the entire walkthrough verbatim (with the exception of modifying the ACCESSKEY and SECRETKEY for the S3 bucket) and am still unable to both download the log falls and have the logs fallback successfully. All my log pods are still healthy as well as my tekton pods, and I am able to view the log files (just not the content when I download them)in the minio UI. Please let me know if there is any other info I can provide to help with this (k8s cluster version is 1.18, not sure which version you are testing on). Thank you!

AlanGreene commented 3 years ago

I've been testing on IKS (1.20), OpenShift CodeReady Containers(OpenShift 4.7), and kind (Kubernetes 1.20 + 1.21)

I'll run through the walkthrough again on a new kind cluster with 1.18 and see how it goes

/reopen

tekton-robot commented 3 years ago

@AlanGreene: Reopened this issue.

In response to [this](https://github.com/tektoncd/dashboard/issues/2196#issuecomment-915095037): >I've been testing on IKS (1.20), OpenShift CodeReady Containers(OpenShift 4.7), and [kind](https://kind.sigs.k8s.io/) (Kubernetes 1.20 + 1.21) > >I'll run through the walkthrough again on a new kind cluster with 1.18 and see how it goes > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
AlanGreene commented 3 years ago

I've just run through the walkthrough with Kubernetes 1.18 and I had to make a few changes to the ingress definitions to work with that version. I'll update the docs to mention the minimum/recommended version for the walkthroughs.

Here are the steps I modified to match your versions:

With these changes it worked as expected, so I would start by checking that your ingresses are healthy and connected correctly.

Next, open your browser dev tools, switch to the network tab, then navigate to the logs view in the Dashboard.

You should see a request for the pod logs, e.g.: GET http://tekton-dashboard.127.0.0.1.nip.io/api/v1/namespaces/tekton-pipelines/pods/sample-tgq2c-gen-log-gdffk-pod-n9hmr/log?container=step-gen-log&follow=true

If you have deleted the TaskRun Pod this should fail with a 404 response, and you should see a subsequent request to the Dashboard external logs proxy: GET http://tekton-dashboard.127.0.0.1.nip.io/v1/logs-proxy/tekton-pipelines/sample-tgq2c-gen-log-gdffk-pod-n9hmr/step-gen-log

This request is backed by the logs service created in the walkthrough, you should be able to access it directly, e.g. for the request above: http://logs.127.0.0.1.nip.io/logs/tekton-pipelines/sample-tgq2c-gen-log-gdffk-pod-n9hmr/step-gen-log

dasxx17 commented 3 years ago

Thanks for the very detailed reply. I am suspecting that there is indeed some issue with my Ingress, as I am not able to see the Logs Server with my host name. I received a 502 and 404 for both the requests to the Dashboard external logs proxy following what you said after deleting the pod: https://my-host-name/v1/logs-proxy/tekton-pipelines/sample-rqgvk-gen-log-fls7q-pod-svh7r/step-gen-log 404: https://my-host-name/api/v1/namespaces/tekton-pipelines/pods/sample-rqgvk-gen-log-fls7q-pod-svh7r/log?container=step-gen-log-2&follow=true

The only difference is that I am not able to access the endpoint directly.

I applied an ALB Ingress Controller for my Ingress, it is defined below:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: logging-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/load-balancer-attributes: routing.http2.enabled=true
    alb.ingress.kubernetes.io/success-codes: 404,200
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-2016-08
    alb.ingress.kubernetes.io/certificate-arn: $CERT
    alb.ingress.kubernetes.io/subnets: $SUBNETS
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    kubernetes.io/ingress.class: alb
spec:
  rules:
    - host: my-host-name
      http:
        paths:
          - path: /*
            backend:
              serviceName: logs-server
              servicePort: 3000

My service definition is the same as yours, except I tried changing the type to a NodePort instead of a ClusterIP as well. Also, I am noticing that my logs-server pod keeps restarting with:

  Normal   Pulled   28m (x177 over 26h)    kubelet  Container image "node:14" already present on machine
  Warning  BackOff  4m15s (x861 over 26h)  kubelet  Back-off restarting failed container

However, the thing I am confused about is why I am unable to view the downloaded logs. The only deviation I have is my minio server is just running locally, but I should still be able to download the logs and view them even if they're running on localhost (I think).

TLDR: Seems like I am running into some painpoints with my Ingress mapping due to corporate environment. I don't think it should matter that you are running IKS, and I am running EKS, as the rest is the same.

Thank you so much for all your help. I will try to get the logs server up, but please let me know if you see some issue with me using an ALB Ingress Controller or if there is some other issue with my Ingress definition. I need to internally request a CNAME mapping with our corporate domain to get the host name mapping for the LB endpoint... however I am getting a 404 when I CURL the Load Balancer that is mapped to my logs-server service, so perhaps there is something wrong there (or with the frequent restarts). Any other information would be super helpful. Thanks!

AlanGreene commented 3 years ago

The restarting logs-server is definitely unexpected, check the logs for that pod to see if there's any clue as to why it's continually failing. The 502 error response makes sense if the Dashboard is unable to proxy requests to the logs-server.

When you say your minio server is running on localhost do you mean on the host machine? How are you exposing it to services running in the cluster? Can you make a request directly from a pod in the cluster to your minio server and get a successful response, e.g. using curl?

I wouldn't expect the ALB to make any difference here, although it's definitely worth checking each of your ingresses to ensure you can successfully connect to their respective services.

AlanGreene commented 3 years ago

Closing as this doesn't seem to be related to the Dashboard itself or the walkthrough, but more a config / environment issue with a custom setup. If you've managed to resolve your issue and think it might be helpful to others a PR to update the walkthrough with a short note describing the problem/fix would be welcome.