[BUG - CrashLoopBackOff] Node 16.15.0 -> 16.15.1 AWS Kubernetes pod startup

rufreakde commented 2 years ago

Environment

Platform: AWS
Docker Version: tried 20.10.16 and 20.10.11 (MAC OS)
Node.js Version: 16.15.1
Image Tag: 16.15-alpine3.15

Expected Behavior

Successfull startup of pod using this image as base image. (16.15.0 cached worked without problems)

Current Behavior

Pod crashes with a CrashLoopBackOff and no messages! We only have the following:

│     State:          Waiting                                                                                                                                                                                                                                                                     │
│       Reason:       CrashLoopBackOff                                                                                                                                                                                                                                                            │
│     Last State:     Terminated                                                                                                                                                                                                                                                                  │
│       Reason:       Error                                                                                                                                                                                                                                                                       │
│       Exit Code:    254                                                                                                                                                                                                                                                                         │
│       Started:      Tue, 14 Jun 2022 11:07:23 +0200                                                                                                                                                                                                                                             │
│       Finished:     Tue, 14 Jun 2022 11:07:24 +0200                                                                                                                                                                                                                                             │
│     Ready:          False

Possible Solution

Rollback that change?

Steps to Reproduce

Build simple docker image with this as base image.
deploy to AWS cluster
pod crashes (older version 16.15.0 works)

Additional Information

We build the docker image on MAC Linux and Windows. Same result old version runs new version fails.

EDIT: Where all began: https://github.com/nodejs/docker-node/commit/194a775693fd40598a1bafd4858e063c24efeb42

jasonleakey commented 2 years ago

Same here.

It was no issue in node v16.15.0 with npm 8.5.5. In the v16.15.1 with npm 8.11.0, the deployment of Pods crashes with CrashLoopBackOff error without any logs, only with the exit code 254.

LaurentGoderre commented 2 years ago

It seems like an issue with running npm. Are you sure the command is running from the dir your expect it to? If you run npm start but there is no package lock in the folder it could create such an error.

rufreakde commented 2 years ago

It seems like an issue with running npm. Are you sure the command is running from the dir your expect it to? If you run npm start but there is no package lock in the folder it could create such an error.

Yes we are 100% sure the problem came only because of the change of the version. Other than that nothing changed.

We found some post on AWS forums where it was suggested to use ENTRYPOINT instead of CMD in the dockerfile.

But both options did not work. It seems this issue appeared several times in the past and the solution by AWS issue creator was to rollback image versions…

One example: https://repost.aws/questions/QUtlb2BYIEQjyirCUWspC-CQ/exit-254-lambda-error-with-no-extra-explanations

LaurentGoderre commented 2 years ago

If you can play a bit with your k8s deployment, I would try overriding the container definition to use a custom command for the container and do a pwd or a few different command to try and debug

rufreakde commented 2 years ago

If you can play a bit with your k8s deployment, I would try overriding the container definition to use a custom command for the container and do a pwd or a few different command to try and debug

As mentioned before we played around with a lot of things. CMD or ENTRYPOINT. Using only hello worlds like really minimalistic scenarios. But it came down to: Previous version worked afterwards did not work. (and i seems we are not the only ones with this issue) @jasonleakey seems to have the same issue.

morty29 commented 2 years ago

@LaurentGoderre I have the same issue, npm is not functional at all in the latest version, it is not related to the working directory since even npm -v returns a newline and the non 0 return code (I bilieve it was 248 for me though). I found this issue searching fro soultuion and downgrading helped. It is not related to command or entrypoint since it is reporducable from command line when you 'terminal into' the pod. Node works fine, everything else seems to be working fine. It is also not a permission issue it seems since I made sure for project directory and /tmp to have same owner as the user I was running commands from. Same exact image works just fine under the same non-root user on my local docker (node with uid and gid changed to 999 and from under default 1000 uid\gid).

jasonleakey commented 2 years ago

I can confirm it's not CMD issue either. I downloaded the image to local and it can be run normally for npm start. I suspect the npm v8.11.0 causes the issue. We noticed similar ERESOLVE issues as this thread Node v16.15.1 (npm v8.11.0) breaks some builds and this thread #npm/cli#4998. Although we solved the peerDependencies issues and compiled the image, the image exits with the empty 254 error.

derekahn commented 2 years ago

Just ran into this issue...

failed to run 🐳 locally with error 243 with limited permissions
then with elevated permissions failed on ☸️ AWS due to strict securityContext settings

Ended up implementing a work-around where I leveraged a multi-stage 🐳 build and ENTRYPOINT.

################
#  Build Stage
################
FROM node:16.15.1 AS build

WORKDIR /app

COPY . .

RUN npm install --production

################
#  Final Stage
################
FROM node:16.15.1-alpine3.16 AS final

WORKDIR /app

COPY --chown=nobody --from=build /app /app

# 🍒 FIX CVE-2022-29244
RUN rm -rf /usr/local/bin/npm \
  && rm -rf /root/.npm

USER nobody:nobody

EXPOSE 8080

ENTRYPOINT ["./bin/start.js"]

Obviously this isn't a one-size fits all due to different project structures and requirements, but hopefully it helps someone 🤞🏽

rufreakde commented 2 years ago

Just ran into this issue...

failed to run 🐳 locally with error 243 with limited permissions

then with elevated permissions failed on ☸️ AWS due to strict securityContext settings

Ended up implementing a work-around where I leveraged a multi-stage 🐳 build and ENTRYPOINT.
################
#  Build Stage
################
FROM node:16.15.1 AS build

WORKDIR /app

COPY . .

RUN npm install --production

################
#  Final Stage
################
FROM node:16.15.1-alpine3.16 AS final

WORKDIR /app

COPY --chown=nobody --from=build /app /app

# 🍒 FIX CVE-2022-29244
RUN rm -rf /usr/local/bin/npm \
  && rm -rf /root/.npm

USER nobody:nobody

EXPOSE 8080

ENTRYPOINT ["./bin/start.js"]
Obviously this isn't a one-size fits all due to different project structures and requirements, but hopefully it helps someone 🤞🏽

We already used multistage build and entrypoint did not help our situation :( But thanks for sharing!

jujaga commented 2 years ago

Looks like this problem is still occurring on the latest 16.16.0-alpine tag.

gzelek8 commented 2 years ago

Is there any update?

phumaster commented 2 years ago

any update guys?

rufreakde commented 2 years ago

Please are there any updates? It seems to be clearly related to an image change.

@PeterDaveHello @nschonni @chorrell @LaurentGoderre @SimenB

ankon commented 2 years ago

We have seen a similar issue as well, and ultimately tracked it down to a native dependency triggering the crash when we updated the build/runtime environment versions.

FROM node:16.15.1 AS build
...
FROM node:16.15.1-alpine3.16 AS final

The underlying cause however was that we used to build on node:not-alpine, and run on node:alpine, just like in this snippet from @derekahn. Alpine uses a different C library compared to the non-alpine variant, and you cannot simply "switch them out" -- so if you do actually build a native dependency, you need to make sure to build it on the same environment as you're ultimately running on.

This problem might stay hidden for a long time, as not all native dependencies get used all the time: In our case it was a crypto library, which got used as a part of a smaller application functionality, for instance.

rufreakde commented 2 years ago

@ankon this seems not to be the case for us. We use the same base alpine image for all of our multistage dockerbuild steps.

# ---- Base Image ----
FROM node:lts-alpine3.15 AS base

ENV DEBIAN_FRONTEND=noninteractive
ENV IMAGE_USER=defaultUser
ENV IMAGE_USER_GROUP=defaultGroup
ENV APP_DIR_IN_USER_DIR=App

RUN \
set -eux \
\
## Update Alpine base \
&& apk update \
&& apk upgrade \
--no-cache \
--progress \
--force-refresh

... base preparation

# ---- NPM Dependencies multistage tests----
FROM base AS build
LABEL env=build
COPY . .

# user
USER root
RUN apk add sqlite
RUN chown -R $IMAGE_USER:$IMAGE_USER_GROUP .
USER $IMAGE_USER

# create and copy production node_modules aside for last layer
RUN set -euxo pipefail                          \
    && npm audit fix --only=production || true  \
    && cp -R node_modules prod_node_modules     \
    && rm -rf node_modules/

# ---- Test ----
# no need for audit on dev dependencies since we remove them from final image
... run tests

# ---- Release ----
FROM base AS run
LABEL env=run

COPY . .
# copy production node_modules
COPY  --from=build /home/$IMAGE_USER/$APP_DIR_IN_USER_DIR/prod_node_modules ./node_modules

# this will not work with headless images we plan to use in the future.
USER root
RUN chown -R $IMAGE_USER:$IMAGE_USER_GROUP . 
USER $IMAGE_USER

EXPOSE 4004
CMD ["npm", "run" , "start"]

Since it is the same base image it should not have this issue right?

ankon commented 2 years ago

At least not in the "trivial" way we could see it in hindsight, the setup in that regard looks sane to me.

Still might be good to check what exactly is crashing, and it is quite likely that there are different underlying causes that manifest in a similar crash unfortunately.

RikuVan commented 2 years ago

We seem to have run into similar issue the pod crashing, some some time ago tried to go to 16.15.1 and failed and now same thing with 16.16.0. We are stuck at FROM node:16.15.0-bullseye-slim.

propattern commented 2 years ago

Due to the following security vulnerabilities

CVE-2022-2097
CVE-2022-29458

We have to update our base image as well from 16.14-alpine3.15. Unfortunately any version above 16.15.1 is having this issue.

The docker image is working fine locally when we run it and in any other state. But when rolled using kubernetes, the application state is: CrashLoopBackOff

Can this issue be prioritised please.

propattern commented 2 years ago

In our case, we managed to fix this issue. We were using a multi stage dockerbuild (install, builder, distribution) and we were using node:16.14-alpine3.15. In order to cater for security vulnerabilities (CVE-2022-2097, CVE-2022-29458), we had to update to node:16.16-alpine3.15

In our case the fix was to explicitly install and downgrade the npm to 8.5.0 in distribution image in our docker file. We tried any version of npm above 8.5.0 and it didn't work and the issue was reproduced again or some other issues surfaced.

Therefore we had to install and fix the npm version to 8.5.0 and specify exact version.

RUN npm install -g npm@8.5.0 --save-exact

Our docker version previously looked like:

# INSTALL CONTAINER
FROM node:16.16-alpine3.15 as install
...

# BUILDER CONTAINER

FROM node:16.16-alpine3.15 as builder
....

# RUNTIME CONTAINER

FROM node:16.16-alpine3.15 AS distribution
....

We then changed it to explicitly set the npm version on each container to be 8.5.0 and our Dockerfile now looks like this and the issue is fixed.

# INSTALL CONTAINER
FROM node:16.16-alpine3.15 as install
RUN npm install -g npm@8.5.0 --save-exact

...

# BUILDER CONTAINER

FROM node:16.16-alpine3.15 as builder
RUN npm install -g npm@8.5.0 --save-exact

....

# RUNTIME CONTAINER

FROM node:16.16-alpine3.15 AS distribution
RUN npm install -g npm@8.5.0 --save-exact

....

This approach for now resolved the problem. I wish our approach helps others fix their problem, but we understand that even our approach is a work around and hope node image distributes a proper working version of npm in their image.

rufreakde commented 2 years ago

RUN npm install -g npm@8.5.0 --save-exact This does the trick!

@propattern It seems the issues is related to the NPM version that is shipping with newer images version. Thanks for sharing the workaround! (thumbs up)

carlosingles commented 2 years ago

Thanks to @propattern for the workaround, this has also worked for our environments.

However, upon further investigation we have managed to fix the problem without downgrading npm and have done so by changing our Dockerfile to run as the node user provided by the Docker image. You may find more documentation around this for other use cases here: https://github.com/nodejs/docker-node/blob/main/docs/BestPractices.md#non-root-user

TL;DR: Use the provided node user

FROM node:16-alpine
# ...
# ...
# At the end, set the user to use when running this image
USER node
CMD ["node", "src/start.js"]

nodejs / docker-node