project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
107 stars 63 forks source link

Reduce Logging noise #678

Open Bobbins228 opened 11 months ago

Bobbins228 commented 11 months ago

Issue link

Closes #612

What changes have been made

Removed occurrences where an entire appwrapper is logged Bumped large informational logs up to level 4 and kept warning logs as level independent. Added new log which logs additional resources required for an appwrapper to be dispatched

Verification steps

Set the logging level for MCAD to be level 3 and check that level 3 related logs are accurate. Repeat with level 4 logs.

To verify the required resources log

Create an AppWrapper that has a large number of resources e.g. 8 GPUs you should receive a log like this:

I1102 16:51:22.195583  119917 queuejob_controller_ex.go:1030] [ScheduleNext] Appwrapper 'default/raytest' requires additional resources CPU: 0.000000, Memory: 0.000000, GPU: 8.000000

Checks

Bobbins228 commented 11 months ago

@ChristianZaccaria Tested this out and it seems no matter the log level it will always print logs below its level. e.g. set log level to 6 you will get all logs below level 6 too.

I have opted to make warning logs like Failed to dispatch app wrapper due to insufficient resources to have no log level so they are always printed.

More informational logs that are noisy have been bumped to log level 4 For example: available resource successful check for appwrapper has an increased log level because we would know if we did not have enough resources anyway.

ChristianZaccaria commented 11 months ago

/retest

Seems the CI is acting up. I'm pretty sure it has nothing to do with your changes.

ChristianZaccaria commented 11 months ago

Hey Mark, could you try change i.e., a comment or add a comment and push changes, just to trigger the CI again and see if it works. The /retest option doesn't retrigger it for some reason

Bobbins228 commented 11 months ago

/retest

dimakis commented 11 months ago

/retest

ChristianZaccaria commented 10 months ago

Hi @Bobbins228, I've ran MCAD locally with the debugger, and when I attempt to bring up a cluster object (cluster.up), it initially says there is insufficient resources to dispatch appwrapper. Then, I get the Raw of a GenericTemplate which are lots of numbers such as below: image Would you know what could this be? I'll run MCAD on the cluster as opposed via the debugger but I assume same result.

Bobbins228 commented 10 months ago

@ChristianZaccaria What log level do you have your MCAD set to?

ChristianZaccaria commented 10 months ago

@ChristianZaccaria What log level do you have your MCAD set to?

On the debugger it's set to 15 by default, I thought that could be it but, update on my previous comment: With this PR I don't see the raw GenericTemplate on MCAD logs when ran in the cluster. However, when I run mcad (main branch) via the debugger, I don't get the Raw GenericTemplate. Something in this PR is causing the debugger to display that for some reason.

openshift-ci[bot] commented 10 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ChristianZaccaria Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment