Closed BenTheElder closed 4 years ago
If there will be help from people who has access to the current cluster I can help with moving that and actually own the ticket /assign
Adding scalability folks. /cc @wojtek-t /cc @mm4tt
I'm not sure about the k8s-mungegithub cluster being maintained or not, but we definitely maintain and use perf-dash on a regular basis.
We do have access to the current cluster and can help with providing all the info you need. The only requirement we have is for the sig-scalability folks to have the ability to view perf-dash logs and be able to deploy new versions of perf-dash whenever we need it.
@mm4tt that would be greatly appreciated! ideally we can move things over to a CNCF cluster where the sig-scale community at large can be granted access as needed and we can spin down the google.com cluster/project.
Sure thing.
It looks like deploying perf-dash is as simple as deploying this deployment and service: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml https://github.com/kubernetes/perf-tests/blob/master/perfdash/perfdash-service.yaml
On top of that we have the perf-dash.k8s.io domain configured to point to the external IP address of the perf-dash service. I have no knowledge on how the domain is configured though.
Let me know if there is anything else you need from us.
The FQDN is managed through OctoDNS: https://github.com/kubernetes/k8s.io/blob/master/dns/zone-configs/k8s.io._0_base.yaml#L170.
I'm gonna start some work in that area today. :-) If I'll need anything I'm gonna ping you @mm4tt
I have started moving perf-dash to aaa
but I faced a problem: aaa
currently doesn't have enough resources to support that requests: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml#L39-L44 - but when I'll confirm these are necessary to run that tool I'll start conversation about providing nodes which are able to provide enough resources.
I discussed with @mm4tt and it looks like these are necessary resources for the project, and currently it's running at 1 node: n1-highmem-2
.
@dims @thockin what is the current process of requesting new node to be added to our current aaa
pool?
/assign @dims @thockin
Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa
.
Should it be added like gcsweb
? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?
/assign @BenTheElder
Wait - a DASHBOARD app needs 8 GB of memory?
Can someone explain why?
On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla notifications@github.com wrote:
Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?
/assign @BenTheElder
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
I am 100% in favor of moving this, but we need to understand the resource footprint.
On Fri, Mar 6, 2020 at 1:36 PM Tim Hockin thockin@google.com wrote:
Wait - a DASHBOARD app needs 8 GB of memory?
Can someone explain why?
On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla notifications@github.com wrote:
Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?
/assign @BenTheElder
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.
Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?
I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"? (Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)
Yes, a static IP is totally fine. Prefer Ingress to Service type=LB because of certs
On Fri, Mar 6, 2020 at 4:24 PM Benjamin Elder notifications@github.com wrote:
@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.
Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?
I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"? (Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
@BenTheElder @thockin got it, thank you for info, and I'm gonna proceed with that approach! :-)
Also @mm4tt, can you give your opinion about why it needs 8GB of RAM?
/assign @mm4tt /unassign @BenTheElder
I know that the memory footprint is a bit unexpected, but it's that, as it stores in memory all the data that it serves. So basically how it works is:
And all the metrics that we have, across all jobs sum up to a lot of data.
Thank you @wojtek-t for your voice here.
My suggestion would be to add the proper node to our cluster for now, and then move the discussion if we somehow can and/or should to improve the application itself.
@thockin @dims wdyt?
+1 to add an appropriate node to our cluster. if i remember right, we used terraform to stand up this cluster, so the changes would have to be done there.
Thank you for an opinion @dims. I'm gonna wait for @thockin and then I'll create a proper PR. :-)
Sorry to be a pain in the rear.
@wojtek-t what sort of traffic is this serving to justify being entirely in memory? If we added n1-standard-4 VMs that's $30/mo each. We could add n1-highmem-2 for a bit less but may be less useful.
In other words, if that had to serve from disk, what bad things would happen?
I am fine to add a second pool of n1-standard-4, I just don't want to be wasteful
@thockin @wojtek-t I also think it could be a great oportunity to improve the memory footprint if there is not enough benefits from having it in memory all the time. Let's move it now for the new infra without modification, but let's start a discussion about improvements.
@wojtek-t I can help doing some research and maybe improving it if you don't have currently time/resources to do so.
@thockin @bartsmykla - I agree that we it should be possible to visibly reduce resource usage, but I'm a bit reluctant for doing this, because:
Those two reasons in my opinion justify not investing into optimizations at this point.
Sorry for not replying earlier, I was on the sick leave last week.
Perf-dash uses about 1 GB of memory once it's initialized. The current usage
kubectl top pods perfdash-7c8746dc-m8xn4
NAME CPU(cores) MEMORY(bytes)
perfdash-7c8746dc-m8xn4 358m 1019Mi
But it requires >4GB during initialization phase. It used to be <4 GB but it started crashlooping and we increased the limit to 8GB in https://github.com/kubernetes/perf-tests/pull/825.
The reason for perfdash using so much memory during initialization is that it scrapes gcs buckets with CI/CD test results looking for the files that it knows how to interpret. Technically, we could reduce the memory footprint of that stage by limiting the parallelism (instead of starting a new goroutine per each test, we could have a fixed size worker pool). But it would increase the init time, which currently is already long, about 20-30min. Given that, it's not obvious whether such "optimization" is a good idea.
In general, I second what Wojtek wrote. Long term we'd like to get rid of perf-dash and replace with something like Mako. Short term, I don't think that optimizing perf-dash to save XX$/month is worth spending a lot of time, especially given that our work on speeding up the scale tests should soon yield savings in tens of thousands of dollars per month.
@mm4tt @wojtek-t whank you guys for comments! I agree that if there are plans to move to something else it's not worth taking time to improve it now. :-)
We are waiting for someone with permissions to run terraform and when the new pool will be provisioned I should move it to the new infra the same day. :-)
Hi everyone. I'm doing final tests before moving everything and created issue: https://github.com/kubernetes/k8s.io/issues/697. When I'll know how to proceed we'll can do final test.
I have created PR with next steps in its description, so feel free to do review :-)
As a followup for the people who don't follow the PR#721. We have perf-dash-canary.k8s.io
running and there are some issues with accessing the cluster by people from k8s-infra-rbac-perfdash@kubernetes.io
group. When it will be solved and we confirm the data in both perf-dash-canary.k8s.io
and perf-dash.k8s.io
are equivalent we'll point the subdomain perf-dash.k8s.io
to the new cluster and would be able to consider this task as done. :-)
/area cluster-infra
/unassign @dims
@thockin will work with @mm4tt to resolve why they're unable to access the perfdash
namespace in the aaa
cluster
/sig scalability
I think I found was was causing the issue with accessing the namespaces! [#758]
@mm4tt as https://github.com/kubernetes/k8s.io/pull/770 is merged and I think @thockin reconcilled groups, can you confirm you have access to the namespace now?
I confirm, I have access and everything seems to be working as it should. We can proceed with pointing perf-dash.k8s.io to the new cluster. Thanks!
My plan right now is:
perf-dash.k8s.io
to point to aaa
cluster + adding new subdomain to the perf-dash
ingress~
perf-dash-canary.k8s.io
record~
LoadBalancer
to NodePort
~
I think it's the time for celebration as we managed to get all steps done! :-)
/close
@bartsmykla: Closing this issue.
🎉
On Wed, Apr 22, 2020, 01:53 Kubernetes Prow Robot notifications@github.com wrote:
@bartsmykla https://github.com/bartsmykla: Closing this issue.
In response to this https://github.com/kubernetes/k8s.io/issues/549#issuecomment-617645525:
I think it's the time for celebration as we managed to get all steps done! :-)
/close
Instructions for interacting with me using PR comments are available here https://git.k8s.io/community/contributors/guide/pull-requests.md. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue: repository.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/k8s.io/issues/549#issuecomment-617645742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK44E6QC3C3N7O67UKDRN2V77ANCNFSM4KMHGEJA .
Thanks, @bartsmykla !!!
http://perf-dash.k8s.io/ is running on the "k8s-mungegithub" cluster in an old google internal GCP project, this cluster is still using
kube-lego
and is not actively maintained as far as I can tell, we should move it to community managed infra (or turn it down if nobody is going to maintain it).cc @krzysied