unable to see System Logs

jason-berk-k1x commented 2 months ago

Please provide us with the following information:

This issue is a: (mark with an x)

[x] bug report -> please search issues before submitting
[ ] documentation issue or request
[ ] regression (a behavior that used to work and stopped in a new release)

Issue description

Where are my System Logs? I have run hundreds of ACA Jobs over the last few hours and a number of them have failed. When I go to look at the system logs I see this:

Screenshot 2024-04-04 at 4 55 02 PM

Steps to reproduce

..
..

Expected behavior [What you expected to happen.]

I can see my system logs

Actual behavior [What actually happened.]

I can't see my system logs

Screenshots

Screenshot 2024-04-04 at 4 55 02 PM

Screenshot 2024-04-04 at 5 17 26 PM

Additional context

this issue exists in the portal on both the job execution pages and when querying the LAW directly

jason-berk-k1x commented 2 months ago

just for context, my job ran multiple times today 4/5/2024 and I see the system logs for all those runs. I still can't see system logs for 4/4/2024

also wanted to mention my production jobs are running in US East, in case that has something to do with it

lihaMSFT commented 2 months ago

Hello Jason,

I looked in our logs. Since you are using dedicated, the log processor restarted from 2024-04-04T04:44:00Z to 2024-04-04T04:56:00Z due to high loads so some logs are lost.

Your container worker was constantly OOMKilled with exit code 137 at that time.

lihaMSFT commented 2 months ago

@jason-berk-k1x you could try using a VM with more memory ~~or you can use V2 consumption-based jobs~~.

Edit: Please ignore the part about v2.

jason-berk-k1x commented 2 months ago

@lihaMSFT I'm confused.

you could try using a VM with more memory

I don't think I can...I'm using a consumption only environment.

or you can use V2 consumption-based jobs

My container app environment (CAE) is Environment type: Consumption only. This CAE only has my single job (for now).

I'm not surprised that the worker container had died from an OOM.....but I still don't understand where my logs went. How are those logs not in the Log Analytics Workspace? I suspect I would have been able to figure out the issue was OOM had I been able to see the system logs.

is it expected that when a container dies during a job, you might not get any system logs?

lihaMSFT commented 2 months ago

Hi Jason, your environment uses too much memory running reader-prod job and our internal log-processor pod unfortunately crashed and lost logs. (This is pretty rare and happens to 0.03% of all environments.) You can see that in the "Diagnose and solve problems" blade in your Container App Environment:

My suggestion is creating a new environment that uses memory optimized workload profile.

jason-berk-k1x commented 2 months ago

@lihaMSFT

where is our disconnect?

Screenshot 2024-04-17 at 9 37 10 AM

Note, this detector does not work with Container App Job pods and replicas.

how does your suggestion help me WRT identifying the issue?

lihaMSFT commented 2 months ago

@jason-berk-k1x we are rolling out some new features for jobs. Sorry you won't be able to diagnose this issue. We are rolling out detailed execution statuses by container so you can tell which container went out of memory. Also, memory metrics for jobs is being worked on, it would be useful in this scenario.

jason-berk-k1x commented 2 months ago

@jason-berk-k1x we are rolling out some new features for jobs. Sorry you won't be able to diagnose this issue. We are rolling out detailed execution statuses by container so you can tell which container went out of memory. Also, memory metrics for jobs is being worked on, it would be useful in this scenario.

is there an issue someplace I can/should be tracking?

lihaMSFT commented 2 months ago

@jason-berk-k1x Here's the issue for metrics: #1027

jason-berk-k1x commented 1 month ago

it's happening again right now..... my ContainerAppSystemLogs_CL table is completely empty for all my failed jobs.....

I have a number of failed jobs and absolutely no insight as to why they failed!!!!

Screenshot 2024-05-13 at 4 38 14 PM

microsoft / azure-container-apps