Move perf-dash to community owned infra

BenTheElder commented 4 years ago

http://perf-dash.k8s.io/ is running on the "k8s-mungegithub" cluster in an old google internal GCP project, this cluster is still using kube-lego and is not actively maintained as far as I can tell, we should move it to community managed infra (or turn it down if nobody is going to maintain it).

cc @krzysied

bartsmykla commented 4 years ago

If there will be help from people who has access to the current cluster I can help with moving that and actually own the ticket /assign

krzysied commented 4 years ago

Adding scalability folks. /cc @wojtek-t /cc @mm4tt

mm4tt commented 4 years ago

I'm not sure about the k8s-mungegithub cluster being maintained or not, but we definitely maintain and use perf-dash on a regular basis.

We do have access to the current cluster and can help with providing all the info you need. The only requirement we have is for the sig-scalability folks to have the ability to view perf-dash logs and be able to deploy new versions of perf-dash whenever we need it.

BenTheElder commented 4 years ago

@mm4tt that would be greatly appreciated! ideally we can move things over to a CNCF cluster where the sig-scale community at large can be granted access as needed and we can spin down the google.com cluster/project.

mm4tt commented 4 years ago

Sure thing.

It looks like deploying perf-dash is as simple as deploying this deployment and service: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml https://github.com/kubernetes/perf-tests/blob/master/perfdash/perfdash-service.yaml

On top of that we have the perf-dash.k8s.io domain configured to point to the external IP address of the perf-dash service. I have no knowledge on how the domain is configured though.

Let me know if there is anything else you need from us.

ameukam commented 4 years ago

The FQDN is managed through OctoDNS: https://github.com/kubernetes/k8s.io/blob/master/dns/zone-configs/k8s.io._0_base.yaml#L170.

bartsmykla commented 4 years ago

I'm gonna start some work in that area today. :-) If I'll need anything I'm gonna ping you @mm4tt

bartsmykla commented 4 years ago

I have started moving perf-dash to aaa but I faced a problem: aaa currently doesn't have enough resources to support that requests: https://github.com/kubernetes/perf-tests/blob/master/perfdash/deployment.yaml#L39-L44 - but when I'll confirm these are necessary to run that tool I'll start conversation about providing nodes which are able to provide enough resources.

bartsmykla commented 4 years ago

I discussed with @mm4tt and it looks like these are necessary resources for the project, and currently it's running at 1 node: n1-highmem-2. @dims @thockin what is the current process of requesting new node to be added to our current aaa pool?

/assign @dims @thockin

bartsmykla commented 4 years ago

Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

/assign @BenTheElder

thockin commented 4 years ago

Wait - a DASHBOARD app needs 8 GB of memory?

Can someone explain why?

On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla notifications@github.com wrote:

Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

/assign @BenTheElder

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

thockin commented 4 years ago

I am 100% in favor of moving this, but we need to understand the resource footprint.

On Fri, Mar 6, 2020 at 1:36 PM Tim Hockin thockin@google.com wrote:

Wait - a DASHBOARD app needs 8 GB of memory?

Can someone explain why?

On Fri, Mar 6, 2020 at 1:26 AM Bart Smykla notifications@github.com wrote:

Also @BenTheElder I'm digging a little bit into how our DNS is working trying to figure out how to add another subdomain which will be pointing to aaa. Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

/assign @BenTheElder

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

BenTheElder commented 4 years ago

@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.

Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"? (Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)

thockin commented 4 years ago

Yes, a static IP is totally fine. Prefer Ingress to Service type=LB because of certs

On Fri, Mar 6, 2020 at 4:24 PM Benjamin Elder notifications@github.com wrote:

@bartsmykla our DNS supports standard records, just checked into config here, we have CNAME, A, etc.

Should it be added like gcsweb? I mean create another ingress which will create loadbalancer and then point subdomain to the IP of that loadbalancer? It looks a little bit unnecessary and maybe to bit "static"?

I don't know how prescriptive we want to be about this exactly. Static Loadbalancer IPs have worked fine for our existing infra though. Why would we expect it to be more "dynamic"? (Especially note that the DNS pointing to that IP is a yaml config PR away so if we switch it to something else later no big deal...)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

bartsmykla commented 4 years ago

@BenTheElder @thockin got it, thank you for info, and I'm gonna proceed with that approach! :-)

bartsmykla commented 4 years ago

Also @mm4tt, can you give your opinion about why it needs 8GB of RAM?

/assign @mm4tt /unassign @BenTheElder

wojtek-t commented 4 years ago

I know that the memory footprint is a bit unexpected, but it's that, as it stores in memory all the data that it serves. So basically how it works is:

for each job it supports and for each run in the last N days/weeks, it loads a number metrics for this run from GCS and stores them in memory [that is very time consuming - initialization takes ~15-30 minutes]
when it serves the request, it simply serves it from memory

And all the metrics that we have, across all jobs sum up to a lot of data.

bartsmykla commented 4 years ago

Thank you @wojtek-t for your voice here.

My suggestion would be to add the proper node to our cluster for now, and then move the discussion if we somehow can and/or should to improve the application itself.

@thockin @dims wdyt?

dims commented 4 years ago

+1 to add an appropriate node to our cluster. if i remember right, we used terraform to stand up this cluster, so the changes would have to be done there.

bartsmykla commented 4 years ago

Thank you for an opinion @dims. I'm gonna wait for @thockin and then I'll create a proper PR. :-)

thockin commented 4 years ago

Sorry to be a pain in the rear.

@wojtek-t what sort of traffic is this serving to justify being entirely in memory? If we added n1-standard-4 VMs that's $30/mo each. We could add n1-highmem-2 for a bit less but may be less useful.

In other words, if that had to serve from disk, what bad things would happen?

thockin commented 4 years ago

I am fine to add a second pool of n1-standard-4, I just don't want to be wasteful

thockin commented 4 years ago

https://github.com/kubernetes/k8s.io/pull/659

bartsmykla commented 4 years ago

@thockin @wojtek-t I also think it could be a great oportunity to improve the memory footprint if there is not enough benefits from having it in memory all the time. Let's move it now for the new infra without modification, but let's start a discussion about improvements.

@wojtek-t I can help doing some research and maybe improving it if you don't have currently time/resources to do so.

wojtek-t commented 4 years ago

@thockin @bartsmykla - I agree that we it should be possible to visibly reduce resource usage, but I'm a bit reluctant for doing this, because:

this will require some investment (it's not an hour of work, rather day+), and given how negligible this resource footprint is comparing to our scale tests (using 5k nodes), we're rather investing into speeding up our tests (this effort is happening as we speak, we believe we can at the first step reduce their length by ~2h, which would save us waaaay more than this one)
I would like to get rid of perf-dash and instead integrate our infrastructure with Mako https://github.com/google/mako (hopefully our intern will do that), which means that the investments will just be for couple months.

Those two reasons in my opinion justify not investing into optimizations at this point.

mm4tt commented 4 years ago

Sorry for not replying earlier, I was on the sick leave last week.

Perf-dash uses about 1 GB of memory once it's initialized. The current usage

kubectl top pods perfdash-7c8746dc-m8xn4
NAME                      CPU(cores)   MEMORY(bytes)
perfdash-7c8746dc-m8xn4   358m         1019Mi

But it requires >4GB during initialization phase. It used to be <4 GB but it started crashlooping and we increased the limit to 8GB in https://github.com/kubernetes/perf-tests/pull/825.

The reason for perfdash using so much memory during initialization is that it scrapes gcs buckets with CI/CD test results looking for the files that it knows how to interpret. Technically, we could reduce the memory footprint of that stage by limiting the parallelism (instead of starting a new goroutine per each test, we could have a fixed size worker pool). But it would increase the init time, which currently is already long, about 20-30min. Given that, it's not obvious whether such "optimization" is a good idea.

In general, I second what Wojtek wrote. Long term we'd like to get rid of perf-dash and replace with something like Mako. Short term, I don't think that optimizing perf-dash to save XX$/month is worth spending a lot of time, especially given that our work on speeding up the scale tests should soon yield savings in tens of thousands of dollars per month.

bartsmykla commented 4 years ago

@mm4tt @wojtek-t whank you guys for comments! I agree that if there are plans to move to something else it's not worth taking time to improve it now. :-)

We are waiting for someone with permissions to run terraform and when the new pool will be provisioned I should move it to the new infra the same day. :-)

bartsmykla commented 4 years ago

Hi everyone. I'm doing final tests before moving everything and created issue: https://github.com/kubernetes/k8s.io/issues/697. When I'll know how to proceed we'll can do final test.

bartsmykla commented 4 years ago

I have created PR with next steps in its description, so feel free to do review :-)

bartsmykla commented 4 years ago

As a followup for the people who don't follow the PR#721. We have perf-dash-canary.k8s.io running and there are some issues with accessing the cluster by people from k8s-infra-rbac-perfdash@kubernetes.io group. When it will be solved and we confirm the data in both perf-dash-canary.k8s.io and perf-dash.k8s.io are equivalent we'll point the subdomain perf-dash.k8s.io to the new cluster and would be able to consider this task as done. :-)

spiffxp commented 4 years ago

/area cluster-infra

spiffxp commented 4 years ago

/unassign @dims @thockin will work with @mm4tt to resolve why they're unable to access the perfdash namespace in the aaa cluster

spiffxp commented 4 years ago

/sig scalability

bartsmykla commented 4 years ago

I think I found was was causing the issue with accessing the namespaces! [#758]

bartsmykla commented 4 years ago

@mm4tt as https://github.com/kubernetes/k8s.io/pull/770 is merged and I think @thockin reconcilled groups, can you confirm you have access to the namespace now?

mm4tt commented 4 years ago

I confirm, I have access and everything seems to be working as it should. We can proceed with pointing perf-dash.k8s.io to the new cluster. Thanks!

bartsmykla commented 4 years ago

My plan right now is:

~Switching perf-dash.k8s.io to point to aaa cluster + adding new subdomain to the perf-dash ingress~
- ~#772~
- ~#773~
~Waiting for confirmation everything works fine after the change~
~Removing unnecessary anymore perf-dash-canary.k8s.io record~
- ~#776~
~Creating PR with change the service type at perf-dash repostory from LoadBalancer to NodePort~
- ~kubernetes/perf-tests#1184~
~Removing part of code which was doing it manually from deployment instructions at our repository~
- ~#778~
~Celebrate a small success 🎉~

bartsmykla commented 4 years ago

I think it's the time for celebration as we managed to get all steps done! :-)

/close

k8s-ci-robot commented 4 years ago

@bartsmykla: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/549#issuecomment-617645525): >I think it's the time for celebration as we managed to get all steps done! :-) > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

BenTheElder commented 4 years ago

🎉

On Wed, Apr 22, 2020, 01:53 Kubernetes Prow Robot notifications@github.com wrote:

@bartsmykla https://github.com/bartsmykla: Closing this issue.

In response to this https://github.com/kubernetes/k8s.io/issues/549#issuecomment-617645525:

I think it's the time for celebration as we managed to get all steps done! :-)

/close

Instructions for interacting with me using PR comments are available here https://git.k8s.io/community/contributors/guide/pull-requests.md. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue: repository.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/k8s.io/issues/549#issuecomment-617645742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK44E6QC3C3N7O67UKDRN2V77ANCNFSM4KMHGEJA .

mm4tt commented 4 years ago

Thanks, @bartsmykla !!!

kubernetes / k8s.io

Move perf-dash to community owned infra #549