Open mhmohona opened 2 months ago
The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)
If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!
[x] The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
git config user.email
in their local Scribe-Data repo[x] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
[ ] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)
So here the next step is pushing docker image to the registry. From which docker id shall I push?
I'll send this one over to @wkyoshida as he's the expert here :) :)
So here the next step is pushing docker image to the registry. From which docker id shall I push?
hmm I guess we have some options. We could create a repository in Docker Hub for us or we could use GitHub's registry.
The latter could be cool since I believe it'd prominently show up under Packages
in the repo - though this option does have a limit of 500MB for free use. If you build the image from the Dockerfile, @mhmohona , about how large is the resulting image?
The former would likely allow us to save several versions of the Docker image though, since we can go beyond 500MB
What's the plan for this one here? @mhmohona, are we ready to merge? Just checking in as you were saying it's hard for you to test this, so do you need support finishing this up?
So I was able to successfully test it in Github codespace. I think I can do it till the end.
Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?
RUN apt-get update && apt-get install -y \
build-essential \
pkg-config \
libicu-dev \
&& rm -rf /var/lib/apt/lists/*
For context @andrewtavis , building with the Dockerfile
locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile
that we already have, i.e. brew bundle install --file=Brewfile
@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though
COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?
RUN apt-get update && apt-get install -y \ build-essential \ pkg-config \ libicu-dev \ && rm -rf /var/lib/apt/lists/*
For context @andrewtavis , building with the
Dockerfile
locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with theBrewfile
that we already have, i.e.brew bundle install --file=Brewfile
@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though
COPY Brewfile /app/ RUN brew bundle install --file=Brewfile && \ export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \ export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
So its the storage status -
I got error if i add Brewfile
, keeping original file gave me following output -
and its the image in docker hub - https://hub.docker.com/repository/docker/mhmohona/scribe-data-cli I created it as test, will delete once everything is okay.
trying to pull, and then will run it locally.
@wkyoshida, what do you think about it?
and its the CLI in action -
I am not sure yet how to make only scribe-data list
work instead of docker run mhmohona/scribe-data-cli list
Hmm - it seems that the image ends up pretty big..
What is the error with trying to use the Brewfile
? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.
With the docker
cli, you could also try inspecting details of the image and looking into the layer details to find out which layers are the large ones, i.e. responsible for the enlarged image size.
With that information, we can go from there and modify the Dockerfile
to address the responsible ones.
Hmm - it seems that the image ends up pretty big.. What is the error with trying to use the
Brewfile
? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.
got following error -
Its the output of docker history -
@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
scribe-data-cli latest 39ac5a96734c 2 days ago 8.59GB
mhmohona/scribe-data-cli latest 39ac5a96734c 2 days ago 8.59GB
@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker history 39ac5a96734c
IMAGE CREATED CREATED BY SIZE COMMENT
39ac5a96734c 2 days ago ENTRYPOINT ["python" "src/scribe_data/cli/ma… 0B buildkit.dockerfile.v0
<missing> 2 days ago ENV PYTHONPATH=/app/src 0B buildkit.dockerfile.v0
<missing> 2 days ago COPY /app /app # buildkit 440MB buildkit.dockerfile.v0
<missing> 2 days ago COPY /usr/local/bin /usr/local/bin # buildkit 28.8MB buildkit.dockerfile.v0
<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> 4 weeks ago CMD ["python3"] 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 10MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_GET_PIP_SHA256=6fb7b781206356f45a… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_GET_PIP_URL=https://github.com/py… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_SETUPTOOLS_VERSION=58.1.0 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_PIP_VERSION=23.0.1 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; for src in idle3 p… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 31.3MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PYTHON_VERSION=3.9.19 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV GPG_KEY=E3FF2839C048B25C084DEBE9B26995E3… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago RUN /bin/sh -c set -eux; apt-get update; a… 9.21MB buildkit.dockerfile.v0
<missing> 4 weeks ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0
<missing> 4 weeks ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0
<missing> 4 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 4 weeks ago /bin/sh -c #(nop) ADD file:3d9897cfe027ecc7c… 74.8MB ```
<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0
hmm.. yeah this seems to be the big culprit (7.99GB). This would be the layer on L25 of the Dockerfile
, and could be due to what is getting pip install
ed on L15. It would be interesting to see how large the layer for the pip install
is. Notice though that we can't really see easily the layers from before L21, which is when the second stage of the Dockerfile
starts. The deepest layer that we see in this output is of the layer on L22:
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
One thing we could try for now is to actually remove the multi-stage aspect of the Dockerfile
and just use a single stage. We'll be able to more easily analyze which layers are heavy in the builder stage.
got following error -
Yeah - what I had first sent you in Matrix is slightly off. Notice that it's using things like GITHUB_ENV
AND GITHUB_PATH
. Those are from when we need to install PyICU in the GitHub workflow (here).
The below might be closer to what we need, but I haven't tested it myself:
COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"
NOTE: This will likely need a step before to install brew though
Just a quick note for clarification -
All the layers further than
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
are most likely those of the python:3.9-slim
image itself that the Dockerfile
is using as the base image.
Alrighty -
So in tweaking the Dockerfile
and building the image myself, I'm seeing that the big culprit here really appears to be the pip install
step in fact (this layer alone ends up at over 8GB in size :scream:). More specifically - in particular the tensorflow
and torch
requirements as they bring in pretty large dependencies. As I am understanding though..
tensorflow
: we're probably good to just scratch it off the requirements.txt
. We are no longer making use of it, is this correct?torch
: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from the requirements.txt
as well soon?requirements.txt
, are there any other dependencies listed that we no longer make use of that we could scratch off too?CC: @andrewtavis
Thank you for looking into it @wkyoshida.
I did also decide to actually go with removing the multi-stage aspect of the Dockerfile
@mhmohona :grin:
The second stage seemed that it was already pretty much copying over most of what was being done in the builder
stage anyways, so it made sense to me to simplify the Dockerfile
in this way
Points on the above:
tensorflow
: we're probably good to just scratch it off therequirements.txt
. We are no longer making use of it, is this correct?
torch
: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from therequirements.txt
as well soon?
- bonus: as we're already having to look at the
requirements.txt
, are there any other dependencies listed that we no longer make use of that we could scratch off too?
pyarrow
, sentencepiece
, tabulate
and transformers
are currently not used or are being used in translation. We need to make a decision on how the "currently available data table" is being made, as I believe that that's why tabulate
was in there. Specifically we're planning on hosting the PNG on Srcribe-Server, but are we generating it with Scribe-Data? How are we making that table PNG, anyway 🤔Ya we can remove it :)
:rocket: - removed with cb29d7b
.. and then from there I can remove the old translation process ..
Ya that sounds good!
For references, I tested out building the image after removing the soon-to-scrapped dependencies. The pip install
layer is currently at ~8GB. After removing..
tensorflow
: went from ~8GB to ~6GBtorch
: went from ~6GB to ~1GBpyarrow
, sentencepiece
, tabulate
and transformers
: went from ~1GB to ~800MBSo yea - I think once we're able to scrap torch
(in particular), we'll be in better shape
How are we making that table PNG, anyway 🤔
I was thinking of leaving it entirely to Scribe-Server actually; so eventually we could even scrap tabulate
too. I'd say there's no rush for it though. We can continue to use tabulate
for this and keep it around until then. Its footprint is fairly small; so no concern with Docker if it's still around.
Contributor checklist
Description
Related issue
Fixes - #150