nix-community / poetry2nix

Convert poetry projects to nix automagically [maintainer=@adisbladis,@cpcloud]
MIT License
864 stars 445 forks source link

Built poetry apps / envs contain `__pycache__` and `*.pyc` #1580

Open declension opened 6 months ago

declension commented 6 months ago

First of all, thanks for a great project! Really helped us and got me further into the Nix rabbit-hole (a long way to go still).

Describe the issue

I know this might not belong to poetry2nix, but thought this might still be a good place to talk about addressing it...

We're Dockerising various Poetry apps using poetr2ynix and dockerTools.streamLayeredImage. Debugging why the image increased in size so much from the [minimalist, multi-stage] Dockerfile version, it seems that Python packages in the Nix store come with __pycache__ and *.pyc files, which adds a lot of weight never present in the traditional images' layers (in fact people often add RUN steps to remove any straggling such files).

Is there a best practice here that I'm missing? Can poetry2nix clean these files out, at least in the locally built package(s)?

takeda commented 6 months ago

The __pycache__ files are python source code compiled to a bytecode. If you remove those files python will be re-generating them every time it starts. You're essentially trading the startup speed for the size of the docker image. I think actually removing the .py files and leaving bytecode could make more sense :) although then debugging might be hard.

As for removing I don't have answer as I did not remove this myself, but perhaps you could use postFixup phase and inject rm command to remove them?

jDmacD commented 4 months ago

@declension did you make any headway on this? I'm seeing a ~100mb difference in size due to caches

declension commented 4 months ago

@jDmacD not really :disappointed:

At one point I doubted myself that it was even happening, but pretty sure it is.

I then tried hacking (badly) various combinations of extraCommands / fakeRootCommands etc (from https://ryantm.github.io/nixpkgs/builders/images/dockertools/#ssec-pkgs-dockerTools-buildLayeredImage) but couldn't even see the cache files at that point, can't remember what my theory was, but it was guesswork anyway.

On re-testing I'm now seeing 83MB of __pycache__ in the image in question, which is an API project with medium-sized set of dependencies, i.e. nothing massive

jDmacD commented 4 months ago

It's definitely happening This Dockerfile produces a 161 mb image

# syntax=docker/dockerfile:latest
# https://medium.com/@albertazzir/blazing-fast-python-docker-builds-with-poetry-a78a66f5aed0
FROM python:3.11-buster as builder

RUN pip install poetry==1.8.3

ENV POETRY_NO_INTERACTION=1 \
    POETRY_VIRTUALENVS_IN_PROJECT=1 \
    POETRY_VIRTUALENVS_CREATE=1 \
    POETRY_CACHE_DIR=/tmp/poetry_cache

WORKDIR /app

COPY pyproject.toml poetry.lock ./
RUN touch README.md

RUN --mount=type=cache,target=$POETRY_CACHE_DIR poetry install --without dev --no-root

FROM python:3.11-slim-buster as runtime

ENV VIRTUAL_ENV=/app/.venv \
    PATH="/app/.venv/bin:$PATH" \
    PYTHONPATH=.

COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}

COPY bgp_operator ./bgp_operator

ENTRYPOINT ["kopf", "run" , "--liveness=http://0.0.0.0:8080/healthz", "-A", "-m", "bgp_operator.main"]

This flake produces a 245 mb image

{
  description = "Application packaged using poetry2nix";

  inputs = {
    flake-utils.url = "github:numtide/flake-utils";
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable-small";
    poetry2nix = {
      url = "github:nix-community/poetry2nix";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };

  outputs = { self, nixpkgs, flake-utils, poetry2nix }:
    flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = nixpkgs.legacyPackages.${system};
        inherit (poetry2nix.lib.mkPoetry2Nix { inherit pkgs; }) mkPoetryApplication cleanPythonSources;
      in
      {
        packages = {
          bgpOperator = mkPoetryApplication {
            projectDir = self;
            checkGroups = [];
          };

          # nix build .#packages.x86_64-linux.dockerImage
          # docker load < result
          dockerImage = pkgs.dockerTools.buildImage {
            name = /bgp-operator-nix";
            tag = "latest";

            copyToRoot = pkgs.buildEnv {
              name = "bgp-operator-env";
              paths = [ self.packages.${system}.bgpOperator ];
            };

            runAsRoot = ''
              mkdir -p /etc/
              echo "root:x:0:0:root:/root:/bin/bash" > /etc/passwd
            '';

            config.Cmd = [ "/bin/bgp-operator" ];
            config.WorkingDir = "/";
          };
          default = self.packages.${system}.bgpOperator;
        };

        devShells.default = pkgs.mkShell {
          packages = [ pkgs.poetry pkgs.dive ];
          shellHook = ''
          export NSX_USERNAME=$(${pkgs.gum}/bin/gum input --placeholder "NSXT username")
          export NSX_PASSWORD=$(${pkgs.gum}/bin/gum input --placeholder "NSXT password" --password)
          '';
        };
      });
}

digging around in the two using dive

Permission     UID:GID       Size  Filetree                                                                       
dr-xr-xr-x         0:0     1.9 MB  │       ├─⊕ y6hmqbmbwq0rmx1fzix5c5jszla2pzmp-tzdata-2024a                      
dr-xr-xr-x         0:0      41 MB  │       ├── y7y3yvzlk2001hgqlzqxhz8aszxffdrx-python3.11-kubernetes-29.0.0      
dr-xr-xr-x         0:0      41 MB  │       │   ├── lib                                                            
dr-xr-xr-x         0:0      41 MB  │       │   │   └── python3.11                                                 
dr-xr-xr-x         0:0      41 MB  │       │   │       └── site-packages                                          
dr-xr-xr-x         0:0      41 MB  │       │   │           ├── kubernetes                                         
-r--r--r--         0:0      844 B  │       │   │           │   ├── __init__.py                                    
dr-xr-xr-x         0:0     1.2 kB  │       │   │           │   ├─⊕ __pycache__                                    
dr-xr-xr-x         0:0      40 MB  │       │   │           │   ├── client                                         
-r--r--r--         0:0      52 kB  │       │   │           │   │   ├── __init__.py                                
dr-xr-xr-x         0:0     266 kB  │       │   │           │   │   ├─⊕ __pycache__                                
dr-xr-xr-x         0:0      24 MB  │       │   │           │   │   ├── api                                        
-r--r--r--         0:0     4.2 kB  │       │   │           │   │   │   ├── __init__.py                            
dr-xr-xr-x         0:0      15 MB  │       │   │           │   │   │   ├─⊕ __pycache__                            
-r--r--r--         0:0     5.2 kB  │       │   │           │   │   │   ├── admissionregistration_api.py           
-r--r--r--         0:0     182 kB  │       │   │           │   │   │   ├── admissionregistration_v1_api.py        
-r--r--r--         0:0     210 kB  │       │   │           │   │   │   ├── admissionregistration_v1alpha1_api.py  
-r--r--r--         0:0     210 kB  │       │   │           │   │   │   ├── admissionregistration_v1beta1_api.py   
Permission     UID:GID       Size  Filetree                                                                             
drwxr-xr-x         0:0      22 kB  │       │           ├─⊕ google_auth-2.29.0.dist-info                                 
drwxr-xr-x         0:0     304 kB  │       │           ├─⊕ idna                                                         
drwxr-xr-x         0:0      12 kB  │       │           ├─⊕ idna-3.7.dist-info                                           
drwxr-xr-x         0:0      16 kB  │       │           ├─⊕ iso8601                                                      
drwxr-xr-x         0:0     5.5 kB  │       │           ├─⊕ iso8601-2.1.0.dist-info                                      
drwxr-xr-x         0:0     644 kB  │       │           ├─⊕ kopf                                                         
drwxr-xr-x         0:0      19 kB  │       │           ├─⊕ kopf-1.37.2.dist-info                                        
drwxr-xr-x         0:0      13 MB  │       │           ├── kubernetes                                                   
-rw-r--r--         0:0      844 B  │       │           │   ├── __init__.py                                              
drwxr-xr-x         0:0      13 MB  │       │           │   ├── client                                                   
-rw-r--r--         0:0      52 kB  │       │           │   │   ├── __init__.py                                          
drwxr-xr-x         0:0     8.6 MB  │       │           │   │   ├── api                                                  
-rw-r--r--         0:0     4.2 kB  │       │           │   │   │   ├── __init__.py                                      
-rw-r--r--         0:0     5.2 kB  │       │           │   │   │   ├── admissionregistration_api.py                     
-rw-r--r--         0:0     182 kB  │       │           │   │   │   ├── admissionregistration_v1_api.py                  
-rw-r--r--         0:0     210 kB  │       │           │   │   │   ├── admissionregistration_v1alpha1_api.py            
-rw-r--r--         0:0     210 kB  │       │           │   │   │   ├── admissionregistration_v1beta1_api.py             
-rw-r--r--         0:0     5.2 kB  │       │           │   │   │   ├── apiextensions_api.py                             
-rw-r--r--         0:0     121 kB  │       │           │   │   │   ├── apiextensions_v1_api.py                          
-rw-r--r--         0:0     5.2 kB  │       │           │   │   │   ├── apiregistration_api.py                           
-rw-r--r--         0:0     118 kB  │       │           │   │   │   ├── apiregistration_v1_api.py                        
-rw-r--r--         0:0     5.2 kB  │       │           │   │   │   ├── apis_api.py                                      

I suspect this is doing the damage, but I don't know how to avoid it

            copyToRoot = pkgs.buildEnv {
              name = "bgp-operator-env";
              paths = [ self.packages.${system}.bgpOperator ];
            };
laurentS commented 1 week ago

I came here trying to understand the same problem (I'm very new to nix). I managed to remove __pycache__ from a single dependency with:

        pkgOverrides = pkgs.poetry2nix.overrides.withDefaults (
          final: prev: {
            somedependency = prev.somedependency.overridePythonAttrs (old: {
              postFixup = ''
                for pycache in $(find $out -name __pycache__) ; do
                  rm -fr ''${pycache}
                done
              '';
            });
          }
        );
        myEnv =
          (pkgs.poetry2nix.mkPoetryEnv {
            overrides = pkgOverrides;
            projectDir = ./.;
            ...
          });

I still have to figure out how to apply this to all dependencies without explicitly listing them. The joys of learning a new language :)

Ultimately, it looks like removing the cached files is a tradeoff between startup time of your container and size, as was mentioned above.

Three small bits of info I learnt while going down this rabbit hole: