Open jason-berk-k1x opened 2 months ago
just for context, my job ran multiple times today 4/5/2024 and I see the system logs for all those runs. I still can't see system logs for 4/4/2024
also wanted to mention my production jobs are running in US East, in case that has something to do with it
Hello Jason,
I looked in our logs. Since you are using dedicated, the log processor restarted from 2024-04-04T04:44:00Z to 2024-04-04T04:56:00Z due to high loads so some logs are lost.
Your container worker
was constantly OOMKilled with exit code 137 at that time.
@jason-berk-k1x you could try using a VM with more memory or you can use V2 consumption-based jobs.
Edit: Please ignore the part about v2.
@lihaMSFT I'm confused.
you could try using a VM with more memory
I don't think I can...I'm using a consumption only environment.
or you can use V2 consumption-based jobs
My container app environment (CAE) is Environment type: Consumption only
. This CAE only has my single job (for now).
I'm not surprised that the worker
container had died from an OOM.....but I still don't understand where my logs went. How are those logs not in the Log Analytics Workspace? I suspect I would have been able to figure out the issue was OOM had I been able to see the system logs.
is it expected that when a container dies during a job, you might not get any system logs?
Hi Jason, your environment uses too much memory running reader-prod
job and our internal log-processor
pod unfortunately crashed and lost logs. (This is pretty rare and happens to 0.03% of all environments.) You can see that in the "Diagnose and solve problems" blade in your Container App Environment:
My suggestion is creating a new environment that uses memory optimized workload profile.
@lihaMSFT
where is our disconnect?
Note, this detector does not work with Container App Job pods and replicas.
how does your suggestion help me WRT identifying the issue?
@jason-berk-k1x we are rolling out some new features for jobs. Sorry you won't be able to diagnose this issue. We are rolling out detailed execution statuses by container so you can tell which container went out of memory. Also, memory metrics for jobs is being worked on, it would be useful in this scenario.
@jason-berk-k1x we are rolling out some new features for jobs. Sorry you won't be able to diagnose this issue. We are rolling out detailed execution statuses by container so you can tell which container went out of memory. Also, memory metrics for jobs is being worked on, it would be useful in this scenario.
is there an issue someplace I can/should be tracking?
@jason-berk-k1x Here's the issue for metrics: #1027
it's happening again right now..... my ContainerAppSystemLogs_CL
table is completely empty for all my failed jobs.....
I have a number of failed jobs and absolutely no insight as to why they failed!!!!
This issue is a: (mark with an x)
Issue description
Where are my System Logs? I have run hundreds of ACA Jobs over the last few hours and a number of them have failed. When I go to look at the system logs I see this:
Steps to reproduce
Expected behavior [What you expected to happen.]
I can see my system logs
Actual behavior [What actually happened.]
I can't see my system logs
Screenshots
Additional context
this issue exists in the portal on both the job execution pages and when querying the LAW directly