scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
23 stars 25 forks source link

Add Docker image file for CLI #168

Open mhmohona opened 2 months ago

mhmohona commented 2 months ago

Contributor checklist


Description

Related issue

Fixes - #150

github-actions[bot] commented 2 months ago

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

mhmohona commented 2 months ago

So here the next step is pushing docker image to the registry. From which docker id shall I push?

andrewtavis commented 2 months ago

I'll send this one over to @wkyoshida as he's the expert here :) :)

wkyoshida commented 2 months ago

So here the next step is pushing docker image to the registry. From which docker id shall I push?

hmm I guess we have some options. We could create a repository in Docker Hub for us or we could use GitHub's registry.

The latter could be cool since I believe it'd prominently show up under Packages in the repo - though this option does have a limit of 500MB for free use. If you build the image from the Dockerfile, @mhmohona , about how large is the resulting image?

The former would likely allow us to save several versions of the Docker image though, since we can go beyond 500MB

andrewtavis commented 1 month ago

What's the plan for this one here? @mhmohona, are we ready to merge? Just checking in as you were saying it's hard for you to test this, so do you need support finishing this up?

mhmohona commented 1 month ago

So I was able to successfully test it in Github codespace. I think I can do it till the end.

wkyoshida commented 1 month ago

Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?

RUN apt-get update && apt-get install -y \
    build-essential \
    pkg-config \
    libicu-dev \
    && rm -rf /var/lib/apt/lists/*

For context @andrewtavis , building with the Dockerfile locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile that we already have, i.e. brew bundle install --file=Brewfile

@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
mhmohona commented 1 month ago

Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?

RUN apt-get update && apt-get install -y \
    build-essential \
    pkg-config \
    libicu-dev \
    && rm -rf /var/lib/apt/lists/*

For context @andrewtavis , building with the Dockerfile locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile that we already have, i.e. brew bundle install --file=Brewfile

@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"

So its the storage status - image

I got error if i add Brewfile, keeping original file gave me following output -

image

and its the image in docker hub - https://hub.docker.com/repository/docker/mhmohona/scribe-data-cli I created it as test, will delete once everything is okay.

image trying to pull, and then will run it locally.

@wkyoshida, what do you think about it?

mhmohona commented 1 month ago

image

and its the CLI in action -

image

I am not sure yet how to make only scribe-data list work instead of docker run mhmohona/scribe-data-cli list

wkyoshida commented 1 month ago

Hmm - it seems that the image ends up pretty big.. What is the error with trying to use the Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.

wkyoshida commented 1 month ago

With the docker cli, you could also try inspecting details of the image and looking into the layer details to find out which layers are the large ones, i.e. responsible for the enlarged image size.

With that information, we can go from there and modify the Dockerfile to address the responsible ones.

mhmohona commented 1 month ago

Hmm - it seems that the image ends up pretty big.. What is the error with trying to use the Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.

got following error - image

Its the output of docker history -


@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED      SIZE
scribe-data-cli            latest    39ac5a96734c   2 days ago   8.59GB
mhmohona/scribe-data-cli   latest    39ac5a96734c   2 days ago   8.59GB

@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker history  39ac5a96734c
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
39ac5a96734c   2 days ago    ENTRYPOINT ["python" "src/scribe_data/cli/ma…   0B        buildkit.dockerfile.v0
<missing>      2 days ago    ENV PYTHONPATH=/app/src                         0B        buildkit.dockerfile.v0
<missing>      2 days ago    COPY /app /app # buildkit                       440MB     buildkit.dockerfile.v0
<missing>      2 days ago    COPY /usr/local/bin /usr/local/bin # buildkit   28.8MB    buildkit.dockerfile.v0
<missing>      2 days ago    COPY /usr/local/lib/python3.9 /usr/local/lib…   7.99GB    buildkit.dockerfile.v0
<missing>      2 days ago    WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   CMD ["python3"]                                 0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   10MB      buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_GET_PIP_SHA256=6fb7b781206356f45a…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_GET_PIP_URL=https://github.com/py…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_SETUPTOOLS_VERSION=58.1.0            0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_PIP_VERSION=23.0.1                   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;  for src in idle3 p…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   31.3MB    buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_VERSION=3.9.19                       0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV GPG_KEY=E3FF2839C048B25C084DEBE9B26995E3…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;  apt-get update;  a…   9.21MB    buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV LANG=C.UTF-8                                0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PATH=/usr/local/bin:/usr/local/sbin:/usr…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B        
<missing>      4 weeks ago   /bin/sh -c #(nop) ADD file:3d9897cfe027ecc7c…   74.8MB    ```
wkyoshida commented 4 weeks ago

<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0 hmm.. yeah this seems to be the big culprit (7.99GB). This would be the layer on L25 of the Dockerfile, and could be due to what is getting pip installed on L15. It would be interesting to see how large the layer for the pip install is. Notice though that we can't really see easily the layers from before L21, which is when the second stage of the Dockerfile starts. The deepest layer that we see in this output is of the layer on L22: <missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0

One thing we could try for now is to actually remove the multi-stage aspect of the Dockerfile and just use a single stage. We'll be able to more easily analyze which layers are heavy in the builder stage.


got following error -

Yeah - what I had first sent you in Matrix is slightly off. Notice that it's using things like GITHUB_ENV AND GITHUB_PATH. Those are from when we need to install PyICU in the GitHub workflow (here).

The below might be closer to what we need, but I haven't tested it myself:

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"

NOTE: This will likely need a step before to install brew though

wkyoshida commented 4 weeks ago

Just a quick note for clarification -

All the layers further than <missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0 are most likely those of the python:3.9-slim image itself that the Dockerfile is using as the base image.

wkyoshida commented 5 days ago

Alrighty -

So in tweaking the Dockerfile and building the image myself, I'm seeing that the big culprit here really appears to be the pip install step in fact (this layer alone ends up at over 8GB in size :scream:). More specifically - in particular the tensorflow and torch requirements as they bring in pretty large dependencies. As I am understanding though..


CC: @andrewtavis

mhmohona commented 5 days ago

Thank you for looking into it @wkyoshida.

wkyoshida commented 5 days ago

I did also decide to actually go with removing the multi-stage aspect of the Dockerfile @mhmohona :grin: The second stage seemed that it was already pretty much copying over most of what was being done in the builder stage anyways, so it made sense to me to simplify the Dockerfile in this way

andrewtavis commented 5 days ago

Points on the above:

  • tensorflow: we're probably good to just scratch it off the requirements.txt. We are no longer making use of it, is this correct?
  • torch: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from the requirements.txt as well soon?
  • bonus: as we're already having to look at the requirements.txt, are there any other dependencies listed that we no longer make use of that we could scratch off too?
wkyoshida commented 3 days ago

Ya we can remove it :)

:rocket: - removed with cb29d7b

.. and then from there I can remove the old translation process ..

Ya that sounds good!

For references, I tested out building the image after removing the soon-to-scrapped dependencies. The pip install layer is currently at ~8GB. After removing..

So yea - I think once we're able to scrap torch (in particular), we'll be in better shape

How are we making that table PNG, anyway 🤔

I was thinking of leaving it entirely to Scribe-Server actually; so eventually we could even scrap tabulate too. I'd say there's no rush for it though. We can continue to use tabulate for this and keep it around until then. Its footprint is fairly small; so no concern with Docker if it's still around.