neuro-inc / neuro-extras

Curated set of MLOps tools to work with the Neu.ro MLOps platform
https://neu-ro.gitbook.io/neuro-extras-reference/cli
Other
4 stars 1 forks source link

[B] Building large Docker images in Kaniko leads to OOMKilled job #283

Open YevheniiSemendiak opened 3 years ago

YevheniiSemendiak commented 3 years ago

Summary

If you build a relatively large docker image, Kaniko will fail to build it since its pod will be killed by K8s due to high memory usage (even in quite large presets, say, having 10Gigs of RAM).

Steps to reproduce

  1. Create a Dockerfile with the following content:
    FROM neuromation/neuro-extras:21.3.19
    RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py && \
    python3 garbage-files-generator.py 1 7Gb
  2. Launch build via neuro-extras image build -s cpu-large . image:test-build-failure
  3. Observe Job was OOMKilled

Expected result

The build finishes properly.

Environment

Mandatory:

Additional information (optional)

Example job ID: job-98cf4efa-4128-49d6-9f8f-01937011ed67 Manual rerun with disabled Kaniko caching (job-c26e3874-fcfa-478f-a862-72a28394853c) leads to the same error.

YevheniiSemendiak commented 3 years ago

Some relevant reports in Kaniko repo:

YevheniiSemendiak commented 3 years ago

v1.6.0 with --cache=false - OOMKilled (metrics) v1.6.0 with --cache=false and default --snapshotMode - OOMKilled (job-699d81aa-0e61-4eef-8287-e7a611906765) v1.5.0 - OOMKilled v1.3.0 - OK, (yet, no --cache-copy-layers flag usage), metrics

YevheniiSemendiak commented 3 years ago

created an issue in Kaniko repo https://github.com/GoogleContainerTools/kaniko/issues/1680

YevheniiSemendiak commented 3 years ago

Downgraded Kaniko to v1.3.0 in #287 , need to bump when the will be fixed in Kaniko repo.

YevheniiSemendiak commented 3 years ago

Kaniko release 1.7.0 resolves the problem. We need to expose that flag for users and bump Kaniko.