microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
374 stars 29 forks source link

Container App revision stuck in provisioning state #477

Closed Masahigo closed 2 years ago

Masahigo commented 2 years ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

I have a simple NodeJS based background worker type of app which should always be "on". The app is served in Node process and written using TypeScript.

Here's the source file

index.ts

import dotenv from 'dotenv';
import http from 'http';
import Parser from 'rss-parser';
import type { CreateBlogAttrs } from '../../../common/src/services/types';
import { saveBlogs } from './queries/blogs';
import { DateTime } from "luxon";
import nodeSchedule from 'node-schedule';

dotenv.config();

const hostname = 'localhost';
const port = 3002;

const server = http.createServer((req, res) => {
            res.statusCode = 200;
            res.setHeader('Content-Type', 'text/plain');
            res.end('Background worker\n');
});

server.listen(port, hostname, () => {
            console.log(`Server running at http://${hostname}:${port}/`);
});

const parser = new Parser({
    customFields: {
      item: [
        ['content:encoded', 'contentEncoded'],
      ]
    }
  });

const SaveBlogs = async () => {

  const feed = await parser.parseURL(xxx);

  const blogAttrs = [];

  feed.items.forEach(async item => {
    const attrs = parseItem(item);
    blogAttrs.push(attrs);
  });

  console.log('Saving items..');
  await saveBlogs(blogAttrs);
};

const parseItem = (item): CreateBlogAttrs => {
  const imgSources = item.contentEncoded.match(/<img [^>]*src="[^"]*"[^>]*>/gm)
    .map(x => x.replace(/.*src="([^"]*)".*/, '$1'));
    const imageUrl = imgSources[0];

  return {
    title: item.title,
    author: item.creator,
    publishDate: item.pubDate,
    imageUrl: imageUrl,
    guidUrl: item.guid,
    tags: item.categories
  };
}

const job: nodeSchedule.Job = nodeSchedule.scheduleJob(process.env.CRON_JOB_SCHEDULE, async () => {
  try {
    console.log(`-- Scheduled job begins at ${DateTime.now().toISOTime()} --`);
    await SaveBlogs();
  } catch (error) {
    console.log("Error occured while saving the blogs - ", error);
  } finally {
    console.log(`-- Scheduled job ends at ${DateTime.now().toISOTime()} --`);
  }
});

This is containerised to a container image.

When I deploy it to ACA as an internal service - the revision is stuck in the provisioning state and never becomes fully active. The scheduled job seems to execute ok but the service never reaches a healthy state. Azure Container Registry is used as the registry and managed user identity to allow pulling off images from it.

Here are the main commands related to the deployment


# Create resources for ACA environment and Log Analytics workspace

az containerapp env create \
  --name "aca-mydemoapp-test"  \
  --resource-group $RESOURCE_GROUP \
  --logs-workspace-id $LAW_ID \
  --logs-workspace-key $LAW_KEY \
  --location "$LOCATION"

az identity create \
  --name"identity-test" \
  --resource-group $RESOURCE_GROUP

IDENTITY_ID=$(az identity show --name "identity-test" --resource-group $RESOURCE_GROUP --query principalId -o tsv)

az role assignment create \
    --assignee-object-id $IDENTITY_ID \
    --assignee-principal-type ServicePrincipal \
    --role AcrPull \
    --scope <id-of-acr-instance>

IDENTITY_RESOURCE_ID=`az identity show \
  --name "identity-test" \
  --resource-group $RESOURCE_GROUP \
  --query id -o tsv`

az containerapp create \
  --name "aca-mydemoapp-worker" \
  --resource-group $RESOURCE_GROUP \
  --environment $ENVIRONMENT \
  --image $ACR_NAME.azurecr.io/background-worker:latest \
  --target-port 3002 \
  --ingress 'internal' \
  --cpu 0.25 \
  --memory 0.5Gi \
  --min-replicas 1 \
  --max-replicas 4 \
  --user-assigned $IDENTITY_RESOURCE_ID \
  --registry-identity $IDENTITY_RESOURCE_ID \
  --secrets redispwd=$REDIS_PWD \
  --env-vars CRON_JOB_SCHEDULE="*/5 * * * *" \
  --registry-server $ACR_NAME.azurecr.io

Steps to reproduce

  1. Deploy the service to vanilla ACA environment (see command above)
  2. Navigate to Container App's initial revision via Azure portal (Application > Revision management)

Expected behavior [What you expected to happen.]

The service should run without issues like in local dev environment.

Actual behavior [What actually happened.]

From system logs I'm able to see the following error

"Startup probe failed: dial tcp 10.250.0.110:3002: connect: connection refused"

But it's confusing as there is no startup probe defined for the service.

Also the following errors are present in system logs

"Back-off restarting failed container" "ScaledObject doesn't have correct scaleTargetRef specification"

Screenshots

image

Additional context

In the Azure portal view as shown in the screenshot above.

When run in local environment, the root url of the service responds with HTTP 200


% curl -I http://localhost:3002
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Thu, 03 Nov 2022 14:09:22 GMT
Connection: keep-alive
Keep-Alive: timeout=5

% curl http://localhost:3002   
Background worker
``
kendallroden commented 2 years ago

So the errors re startup probes: there is a default health probe for apps so that we as a service can ensure your app is up and running before we forward traffic to it- that is the probe that you see failing. You can override the default probe by applying any health probe

As an aside, we don't support the concept of "Jobs" natively in aca today. Apps are considered always running so I have a feeling this could be based on your use of cron within the app itself. Jobs are currently being designed and will support triggering via cron. I am unsure if this workload will run properly based on your use of cron within the app itself.

It's hard to troubleshoot this without access to your env. Do you mind opening a support ticket so we can properly address? If not can you send your env details and this github issue to acasupport@microsoft.com?

Masahigo commented 2 years ago

Can you instruct how to override the default health probe via Azure CLI? I couldn't find an example from here.

I noticed the NodeJS http server implementation was having some race condition issues, I was able to reproduce that same error ".. connection refused" locally.

kendallroden commented 2 years ago

Via the CLI you can pass in a yaml file: https://learn.microsoft.com/en-us/azure/container-apps/health-probes?tabs=yaml#http-probes to override the default. Is not required to override them, more so just wanted to indicate that is what that error message is coming from :)

Masahigo commented 2 years ago

The race condition issue I mentioned about was fixed by making the following change: adding return to close the HTTP connections from the startup probe

const server = http.createServer((req, res) => {
            res.statusCode = 200;
            res.setHeader('Content-Type', 'text/plain');
            return res.end('Background worker\n');
});

But the service itself is still stuck to provisioning state - at least in that Revision management UI view. Seems the way I've implemented the cron job within the service is the root cause for it.

Nevertheless, the service is working as intended to so closing this issue for now.