operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

NEW PROJECT: PrometheusAI #454

Closed Ofir-Shechtman closed 2 years ago

Ofir-Shechtman commented 2 years ago

Target cluster

No response

Team name

Technion_Library

Desired project names

PrometheusAI

Project description

We are a group of Technion students who are in collaboration with Ilya Kolchinsky, PhD from RedHat (ikolchin@redhat.com) working on the enhance of the capabilities of Prometheus by adding AI predication powers to this monitoring system.

We'll need access to the Prometheus server on the operateFirst clusters, as well as access in order to build our project on the OperateFirst system. We haven't chose yet a cluster because we're not sure which one will fit our needs/goal.

Users needing access

moradnir, Ofir-Shechtman, guyelf, benlugasi

Namespace Quota

Small

Custom quota

No response

Your GPG key or contact

No response

HumairAK commented 2 years ago

Hello!

We'll need access to the Prometheus server

What sort of access are you looking for?

Ofir-Shechtman commented 2 years ago

Hi, Our task is to implement a predictive component for Prometheus metrics. This component will incorporate a variety of well-known time series forecasting (TSF) algorithms, based on statistical methods, deep neural networks, or their combination. It will receive as input a stream of Prometheus updates (i.e., files containing the last recorded values for the monitored metrics) and generate a stream of predicted future values for all involved metrics.

In order to complete our project, we will probably need access to all statistical data and metrics of all the applications monitored by Prometheus in operate-first.

Thanks for helping us

durandom commented 2 years ago

+1 from me. @4n4nd @Shreyanand do we have notebooks that access the Prometheus data in our environment? I don't think we need namespaces just yet :)

4n4nd commented 2 years ago

@Ofir-Shechtman if I recall correctly, @Shreyanand and our team have worked on a repo just for time-series metric prediction (here).

For access to cluster metrics, we will need to onboard you as a group/team and then give cluster metrics access to this team.

Could you please try to use the opfcli tool that we have here to create a group/team?

durandom commented 2 years ago

@4n4nd can you also point to some hitchhiker guide on how to use the toolbox to get opfcli installed?

Shreyanand commented 2 years ago

@Ofir-Shechtman The time series repository and book has learning content for applying analysis and forecasting on metrics from cloud. Your group may also be interested in Operate First Jupyterhub Analysis project, that collects and analyzes metrics corresponding to the infrastructure usage of Jupyterhub on Operate First. This notebook will be a starter for accessing such data from Prometheus (logs, metrics, and events). There are also notebooks corresponding to a resource allocation problem we defined based on CPU and memory usage time series. Feel free to play around with the notebooks, and I'll be happy to answer any questions you may have.

durandom commented 2 years ago

@Shreyanand are those notebooks available as a Notebook Image in our operate first jupyter hub?

Shreyanand commented 2 years ago

@durandom The time series project has it and the instructions can be found here. The Opf Jupyterhub analysis project is relatively new so we don't have an image yet, I'll add an issue in the repository.

benlugasi commented 2 years ago

@Ofir-Shechtman if I recall correctly, @Shreyanand and our team have worked on a repo just for time-series metric prediction (here).

For access to cluster metrics, we will need to onboard you as a group/team and then give cluster metrics access to this team.

Could you please try to use the opfcli tool that we have here to create a group/team?

Hey, We've installed the 'opfcli' on my local Linux machine and created a group named PrometheusAI. Then we updated the 'group.yaml' file. I couldn't find any further explanation in the 'opfcli' repository, is it enough? image

durandom commented 2 years ago

@4n4nd can point you to clearer docs or create them ;)

4n4nd commented 2 years ago

@Ofir-Shechtman while I work on the docs for this, you can make the following changes and create a PR for us to review it.

  1. All github usernames should be in lowercase
  2. you will need to add this group in this kustomization so that this group is available in all of our clusters.
benlugasi commented 2 years ago

@Ofir-Shechtman while I work on the docs for this, you can make the following changes and create a PR for us to review it.

  1. All github usernames should be in lowercase
  2. you will need to add this group in this kustomization so that this group is available in all of our clusters.

Hey Anand, I've cloned the 'apps' repo, modified the files as you said to a new branch. But I can't push it. I tried using, SSH key or without SSH, and both gave me a denial. Can you give me (benlugasi) permissions for pushing into 'apps'?

Thanks, Ben

4n4nd commented 2 years ago

Can you give me (benlugasi) permissions for pushing into 'apps'?

@benlugasi we don't push directly to the apps repo. You will need to fork the https://github.com/operate-first/apps repo first.

  1. Click on the Fork button on the top right of this page.
  2. Once this creates a copy of the repo in your account, you can make changes to this new repo and push them.
  3. After pushing changes to your fork, you should be able to create a Pull Request to get your changes merged in the operate-first/apps repo (more instructions on how to do this).
benlugasi commented 2 years ago

Can you give me (benlugasi) permissions for pushing into 'apps'?

@benlugasi we don't push directly to the apps repo. You will need to fork the https://github.com/operate-first/apps repo first.

  1. Click on the Fork button on the top right of this page.
  2. Once this creates a copy of the repo in your account, you can make changes to this new repo and push them.
  3. After pushing changes to your fork, you should be able to create a Pull Request to get your changes merged in the operate-first/apps repo (more instructions on how to do this).

Hey, I've created a PR for your review, you can find it here. Please tell us if there is anything else that needs to be done.

Thanks, Ben

moradnir commented 2 years ago

Hey, can you please provide us a guide of:

  1. How to import data from Prometheus?
  2. How to run programs on Red Hat servers?
benlugasi commented 2 years ago

Hey, any update?

durandom commented 2 years ago

@benlugasi what exactly are you missing? I see you closed your PR.

4n4nd commented 2 years ago

@moradnir

  1. How to import data from Prometheus?

We have some docs available here for API access.

  1. How to run programs on Red Hat servers?

Can you explain what your workload is and how you are planning to deploy it? If you just need a namespace on one of our clusters, you can follow this guide. The smaug cluster should be the right one for you.

benlugasi commented 2 years ago

@moradnir

  1. How to import data from Prometheus?

We have some docs available here for API access.

  1. How to run programs on Red Hat servers?

Can you explain what your workload is and how you are planning to deploy it? If you just need a namespace on one of our clusters, you can follow this guide. The smaug cluster should be the right one for you.

Hey @4n4nd , Your docs were very helpful and really helped us to make a progress. Also, we couldn't access the servers. I saw that we don't have any project to our group, so i opened one and assigned it to smaug cluster. Here is a PR for this project, hope it'll solve our access problems. Thanks, Ben

benlugasi commented 2 years ago

Now that we have an active project and group assigned to smaug I'm trying to connect Thanus using the 'operate-first' button and I'm getting 403 Permission Denied

Can you tell me what am I doing wrong? Or maybe give us some examples to start with. Our main first task is to deploy a recurring job that collects data from Prometheus to a file. We already have project and group on operate-first/apps

benlugasi commented 2 years ago

@4n4nd I saw this PR merged, It should work now? I still see this error on Thanus: image

4n4nd commented 2 years ago

@benlugasi the changes in that PR weren't applied yet, can you please try again?

benlugasi commented 2 years ago

@4n4nd Now it works, thanks! We'll try to follow the docs and get some data :)

guyelf commented 2 years ago

Hi Team,

We're using the templates provided in this repo But the examples use the Prometheus demo server and we're trying to access the Thanos server (provided here) Our querying attempts via the provided python syntax are getting blocked by a 403 error caused by the authentication requirements of the Thanos server which is protected by the OpenShift oauth-proxy (version 2.3.0) .

None of our attempts for generating the correct authentication token were successful. Do you have any guides on how to access this server via python or simply the http API over the authentication request?

4n4nd commented 2 years ago

@guyelf I added some instructions for programmatic access to thanos metrics here. Lmk if these instructions don't work for you.

benlugasi commented 2 years ago

@guyelf I added some instructions for programmatic access to thanos metrics here. Lmk if these instructions don't work for you.

We've managed to access the Thanos from our code using your instructions, thanks! Unfortunately, all the queries we receive are empty, although last week we could see the data from the Server UI.

Something has changed?

benlugasi commented 2 years ago

Hey @4n4nd , It has been a week and we still don't see anything on Thanus Server, any idea?

guyelf commented 2 years ago

Hi Team,

Just to add more context here, it appears we now have more issues also with the authentication server.

There's a 503 error message being thrown by the authentication server which prevents us from accessing the server. This also prevents us from creating tokens or interacting with the Thanos server in general.

Error message as follows: image

4n4nd commented 2 years ago

hey the auth issue should be resolved now. Can you please check again if you can query for data?

guyelf commented 2 years ago

Hi Team,

Me again, we're working on the auth server again and again it looks like it's down:

image

Same issue as before. Can anyone help us recover the faulty server so we can continue our project? Thanks in advance,

Additionally, if you have any documentation on how to run jobs on the Smaug cluster it will be very helpful.

Best Regards,

durandom commented 2 years ago

@guyelf can you open a new issue for this problem with a description on how to replicate the problem? I just see a keycloak URL.

Also, is this issue here (#454) good to be closed?

HumairAK commented 2 years ago

Apologies, just got back from the holidays. Auth seems up, is this issue resolved?

benlugasi commented 2 years ago

Hey, Auth seems to be ok now, and the issue is good to be closed. We would like to know if there are any other communication channels to pop some more questions in the future? Moreover, we still don't know how to run jobs on smaug cluster. Can you share with us relevant documentation for doing so?

durandom commented 2 years ago

closing, since the namespace is up. Any further problems should go into new issues