styler00dollar / VSGAN-tensorrt-docker

Using VapourSynth with super resolution and interpolation models and speeding them up with TensorRT.
BSD 3-Clause "New" or "Revised" License
271 stars 30 forks source link

Docker build errors #69

Open abcnorio opened 5 months ago

abcnorio commented 5 months ago

Hij,

there were some errors in the Docker build:

(1)

363c365
<   CFLAGS=-fPIC meson setup -Dlink_static=true build && CFLAGS=-fPIC ninja -C build && ninja -C build install
---
>   CFLAGS=-fPIC meson setup build && CFLAGS=-fPIC ninja -C build && ninja -C build install

-> the '-DLink_static=true' does not exist, should it be '-Ddefault_library=static' but using that it does not seem to build properly, because the *.so file is missing. For the time being just remove the static switch, but that's not a real solution, right?

(2)

804c806
<   /workspace/Python-3.11.3/libpython3.so /usr/lib
---
>   /workspace/Python-3.11.3/libpython3.so /usr/lib/

-> error: to copy files target should be directory

(3)

822c824
< COPY --from=bestsource-lsmash-ffms2-vs /workspace/L-SMASH-Works/VapourSynth/build/libvslsmashsource.so /workspace/bestsource/build/libbestsource.so /usr/local/lib/vapoursynth
---
> COPY --from=bestsource-lsmash-ffms2-vs /workspace/L-SMASH-Works/VapourSynth/build/libvslsmashsource.so /workspace/bestsource/build/libbestsource.so /usr/local/lib/vapoursynth/

-> error: to copy files target should be directory

(4)

842c844,845
< COPY --from=TensorRT-ubuntu /usr/local/tensorrt/lib/libnvinfer_plugin.so* /usr/local/tensorrt/lib/libnvinfer_vc_plugin.so* /usr/local/tensorrt/lib/libnvonnxparser.so* /usr/lib/x86_64-linux-gnu/

-> error: COPY failed: no source files were specified -> that's unclear to me what is missing - is this related to 'libnvinfer_plugin.so.*' that were previously deleted? Around line 743:

  rm -rf /root/.cache/ /usr/local/lib/python3.11/site-packages/tensorrt_libs/libnvinfer.so.* /usr/local/lib/python3.11/site-packages/tensorrt_libs/libnvinfer_builder_resource.so.* \
    /usr/local/lib/python3.11/site-packages/tensorrt_libs/libnvinfer_plugin.so.* /usr/local/lib/python3.11/site-packages/tensorrt_libs/libnvonnxparser.so.*

But leaving this out does not resolve the problem.

(5)

901c904
< COPY --from=base /workspace/hotfix/* /workspace
---
> COPY --from=base /workspace/hotfix/* /workspace/

-> error: to copy files target should be directory

(6)

A warning can be added that parallel buildkit builds do not work. Could try that out on a 2x CPU xeon and it failed at several spots but had no time to find out how to prevent the failures.

Please correct, esp. (1) and (4).

Thanks!f

styler00dollar commented 5 months ago

the '-DLink_static=true' does not exist

Seems like bestsource removed it 3 days ago here. Need to adjust.

to copy files target should be directory

That slash doesn't matter because file commands detect if it is a folder. It might be easier to read for humans though. If it does throw an error for you and crashes, how exactly does your build env look like?

that's unclear to me what is missing - is this related to 'libnvinfer_plugin.so.*'

Hm, that looks odd. I delete some libs here https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L740 because i just link the libs afterwards here https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L886 to save space since the files are the same. The files in /usr/local/tensorrt/lib/ should exist. The files get moved here. https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L503 It worked for me the when I built it around a week ago.

A warning can be added that parallel buildkit builds do not work. Could try that out on a 2x CPU xeon and it failed at several spots but had no time to find out how to prevent the failures.

The only reason for it to not work should be out of ram memory issues. With DOCKER_BUILDKIT=1 docker build -t styler00dollar/vsgan_tensorrt:latest . it builds multiple stages at once. The Dockerfile was made to work with 64gb ram and thus can easily crash if not much ram is available, but I never tested with multiple cpus.

abcnorio commented 5 months ago

Thanks,

(1) build env:

64 GB RAM, AMD Ryzen 5 3600 6-Core Processor, Debian bullseye

ii docker 1.5-2 all transitional package ii docker-clean 2.0.4-3 all simple Shell script to clean up the Docker Daemon ii docker-compose 1.25.0-1 all Punctual, lightweight development environments using Docker ii docker-doc 20.10.5+dfsg1-1+deb11u2 all Linux container runtime -- documentation ii docker-registry 2.7.1+ds2-7+deb11u1 amd64 Docker toolset to pack, ship, store, and deliver content ii docker.io 20.10.5+dfsg1-1+deb11u2 amd64 Linux container runtime ii docker2aci 0.17.2+dfsg-2.1+b5 amd64 CLI tool to convert Docker images to ACIs ii python3-docker 4.1.0-1.2 all Python 3 wrapper to access docker.io's control socket ii python3-dockerpty 0.4.1-2 all Pseudo-tty handler for docker Python client (Python 3.x) ii wmdocker 1.5-2 amd64 System tray for KDE3/GNOME2 docklet applications

(2) Adding the slash made indeed a difference and the errors disappeared. Repeated that several times.

(3) Yes, saw that you deleted those files before adding symlinks later. How can I enter the build stage to inspect it manually via bash/ shell at that stage? Sorry, I am not very familiar with docker. Will re-do this part of the build tomorrow (my CPU is not that fast, so it takes quite some time) and send the exact output at time of break

(4) Regarding 2x CPU xeon -> just wanted to see whether the same errors occured but other errors popped up like

=> CACHED [ffmpeg-arch 19/42] RUN git clone https://github.com/webmproject/libvpx/ && cd libvpx && ./configure --enable-static --enable-vp9-highbitdepth --disable-shared --disable 0.0s => CACHED [ffmpeg-arch 20/42] RUN git clone https://code.videolan.org/videolan/x264.git && cd x264 && ./configure --enable-pic --enable-static && make -j$(nproc) install 0.0s => ERROR [tensorrt-ubuntu 9/20] RUN pip3 install /usr/local/tensorrt/python/tensorrt--cp311-.whl 1.0s => CANCELED [ffmpeg-arch 21/42] RUN git clone https://bitbucket.org/multicoreware/x265_git/ && cd x265_git/build/linux && cmake -G "Unix Makefiles" -DENABLE_SHARED=OFF -D HIGH_BIT 1.1s => CACHED [base 2/49] COPY nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json 0.0s => CACHED [base 3/49] RUN apt-get update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates && curl -fsSL https://developer.download.nvidia.com/compute/c 0.0s => CACHED [base 4/49] RUN apt-get update && apt-get install -y --no-install-recommends cuda-12-1 cuda-cudart-12-1 cuda-compat-12-1 && rm -rf /var/lib/apt/lists/* 0.0s => CACHED [base 5/49] RUN echo "/usr/local/nvidia/lib" >>/etc/ld.so.conf.d/nvidia.conf && echo "/usr/local/nvidia/lib64" >>/etc/ld.so.conf.d/nvidia.conf 0.0s => CACHED [base 6/49] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends libx11-xcb-dev libxkbcommon-dev libwayland-dev libxran 0.0s => CACHED [base 7/49] WORKDIR workspace 0.0s => CACHED [base 8/49] RUN apt update -y && apt install liblzma-dev libbz2-dev ca-certificates openssl libssl-dev libncurses5-dev libsqlite3-dev libreadline-dev libtk8.6 libgdm-dev 0.0s => CACHED [base 9/49] RUN update-alternatives --install /usr/bin/python python /usr/local/bin/python3.11 1 && update-alternatives --install /usr/bin/pip pip /usr/local/bin/pip3.1 0.0s => CACHED [base 10/49] RUN wget "https://bootstrap.pypa.io/get-pip.py" && python get-pip.py --force-reinstall 0.0s => CANCELED [base 11/49] RUN rm -rf Python-3.11.3 && tar -xf Python-3.11.3.tar.xz && cd Python-3.11.3 && CFLAGS=-fPIC ./configure --enable-shared --with-ssl --with-openssl-rpath=a 1.2s

[tensorrt-ubuntu 9/20] RUN pip3 install /usr/local/tensorrt/python/tensorrt--cp311-.whl:

56 0.853 /bin/sh: 1: pip3: not found


executor failed running [/bin/sh -c pip3 install /usr/local/tensorrt/python/tensorrt--cp311-.whl]: exit code: 127

This could be prevented by just calling the reinstall of pip twice which looked to me like it tried to use pip before it was installed (therefor the idea the cause is the parallel building).

then next error:

[...] executor failed running [/bin/sh -c apt install fftw3-dev python-is-python3 pkg-config python3-pip git p7zip-full autoconf libtool yasm ffmsindex libffms2-5 libffms2-dev -y && git clone https://github.com/sekrit-twc/zimg --depth 1 --recurse-submodules --shallow-submodules && cd zimg && ./autogen.sh && CFLAGS=-fPIC CXXFLAGS=-fPIC ./configure --enable-static --disable-shared && make -j$(nproc) && checkinstall -y -pkgversion=0.0 && apt install /workspace/zimg/zimg_0.0-1_amd64.deb -y]: exit code: 100

btw - here the env is probably bullseye as well, I am not admin on the computer, just can use it. Have to find out how to use buldkit without parallel build, looks like

COMPOSE_PARALLEL_LIMIT=1 [...]

did not work out. Still looks like a parallel build. Parallel does not work, too many things depend on each other. As I cannot write to /etc/docker/... on that computer have to find out how to disable parallel build with buildkit which seems to work automatically if possible.

best + thanks.

the '-DLink_static=true' does not exist

Seems like bestsource removed it 3 days ago here. Need to adjust.

to copy files target should be directory

That slash doesn't matter because file commands detect if it is a folder. It might be easier to read for humans though. If it does throw an error for you and crashes, how exactly does your build env look like?

that's unclear to me what is missing - is this related to 'libnvinfer_plugin.so.*'

Hm, that looks odd. I delete some libs here https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L740 because i just link the libs afterwards here https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L886 to save space since the files are the same. The files in /usr/local/tensorrt/lib/ should exist. The files get moved here. https://github.com/styler00dollar/VSGAN-tensorrt-docker/blob/ac35e8dd92cfdcbc9db68e572527e86db0cf7cf3/Dockerfile#L503 It worked for me the when I built it around a week ago.

A warning can be added that parallel buildkit builds do not work. Could try that out on a 2x CPU xeon and it failed at several spots but had no time to find out how to prevent the failures.

The only reason for it to not work should be out of ram memory issues. With DOCKER_BUILDKIT=1 docker build -t styler00dollar/vsgan_tensorrt:latest . it builds multiple stages at once. The Dockerfile was made to work with 64gb ram and thus can easily crash if not much ram is available, but I never tested with mulitple cpus.

-- Reply to this email directly or view it on GitHub: https://github.com/styler00dollar/VSGAN-tensorrt-docker/issues/69#issuecomment-2041582344 You are receiving this because you authored the thread.

Message ID: @.***>

abcnorio commented 5 months ago

Update:

(1) copy error

Step 229/246 : COPY --from=TensorRT-ubuntu /usr/local/tensorrt/lib/libnvinfer_plugin.so* /usr/local/tensorrt/lib/libnvinfer_vc_plugin.so* /usr/local/tensorrt/lib/libnvonnxparser.so* /usr/lib/x86_64-linux-gnu/
COPY failed: no source files were specified

replace

/usr/local/tensorrt/lib/

by

/usr/local/tensorrt/targets/x86_64-linux-gnu/lib/

and then it works - seems somehow 'docker build' does not follow the symlinks properly, as the first location is just a symlink for the second.

Did inspect the intermediate stage and all was built properly, so the symlink seemed to be the reason.

(2) multiple cpus

Docker is unfriendly if it comes to shut down parallelism which is enabled by default if you do not have buildx at hand. so we leave that out, but fact is the Dockerfile does not work with parallel building, because things depend on each other in sequence and it probably would require some rewrite to find out what can be built parallel and what not.

With buildx and buildkit.toml config file parallelism can be tweaked (and therefor shut down). Could not try it because of lack of admin rights on the server (would require a complete docker upgrade), but should work (in theory).

All in all, build went out fine along with the notions mentioned:

(3) further vps plugins

Will add some more plugins - if the build will work will send you a diff file with those addons.

best + thanks.

PS: for general information the disk usage:

REPOSITORY TAG IMAGE ID CREATED SIZE vsgantensorrt latest 076a4aa18920 5 minutes ago 13.2GB

1409296bdfba 38 minutes ago 29.5GB 5f114124f0cb 42 hours ago 9.28GB 2bddc8250e39 42 hours ago 9.28GB 09959e24c20f 42 hours ago 9.28GB 04fe7b6e452c 43 hours ago 38.9GB 689ebefb0b16 47 hours ago 7.63GB 8a617091bb5a 47 hours ago 9.31GB bbd5c8ca9582 2 days ago 16.3GB 1e06a11db5ae 2 days ago 19.6GB 84379595e659 3 days ago 17.8GB archlinux/archlinux latest 43974225a80a 3 days ago 449MB ubuntu 22.04 ca2b0f26964c 5 weeks ago 77.9MB nvidia/cuda 12.1.1-devel-ubuntu22.04 5ed6afba2273 5 months ago 7.03GB nvidia/cuda 12.1.1-runtime-ubuntu22.04 0495908f9381 5 months ago 2.24GB