[Feature Request]: Docker built only with latest CUDA, offer variation for CUDA 12.3?

ErroneousBosch commented 3 months ago

Checklist

[X] The issue has not been resolved by following the troubleshooting guide
[X] The issue exists on a clean installation of Fooocus
[X] The issue exists in the current version of Fooocus
[X] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

Attempting to run the new docker image from package repository on TrueNAS Dragonfish (which has its NVIDIA drivers locked to 545.23.08/CUDA 12.3), the image will not start:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.4, please update your driver to a newer version, or use an earlier cuda container: unknown

Steps to reproduce the problem

Run drivers that do not support CUDA 12.4
Try to run Docker image

What should have happened?

Ideally, should have started.

What browsers do you use to access Fooocus?

Google Chrome

Where are you running Fooocus?

Locally with virtualization (e.g. Docker)

What operating system are you using?

TrueNAS Dragonfish

Console logs

No log because App does not start

Additional information

Offering this kind of variation may be helpful for anyone running in an orchestrated environment.

mashb1t commented 3 months ago

hey @ErroneousBosch, problem with older versions is that some packages are not compiled or compatible, e.g. torch / xformers. This would lead to increased maintenance effort and a diverging codebase for each release with separate CUDA versions.

You can always change the base image locally and build the container yourself, but I'm also open for your feedback.

EDIT 12.3 should be compatible with current package versions, but please double check.

ErroneousBosch commented 3 months ago

I did build a 12.3.2 version which I am using now. TrueNAS is a bit annoying in that it doesn't let you build locally, so mine is on DockerHub. I can confirm it builds and runs fine. Ironically I couldn't build it before due to the locked package version of torch conflicting with the xformers

This probably is more a feature request/enhancement so that those of us stuck in a managed env don't have to watch and rebuild each new release.

ErroneousBosch commented 3 months ago

actually I do get a startup error on the 12.3.2:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.1.0+cu121)
    Python  3.10.13 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details

Which is what I was seeing before. the app runs, but is slightly unstable. Seems relatively fast though, even on my 3060?

Edit: looking in the 12.4.1 cuda image, it has the same version of python, so may be throwing the same error? looks like this is an issue with ubuntu 22.04 not updating its version of python in 10 months, so I am not sure if xformers is even working inside Docker

ErroneousBosch commented 3 months ago

Dug around some more and there's a couple of barriers:

the nvidia/cuda images only go up to Ubuntu 22.04
Ubuntu 22.04 only oficially goes up to Python 3.10.12 (so close!)
There are no Cuda packages for Ubuntu 24.04

So instead I built a new Dockerfile that brings in a newer version of Python from the deadsnakes PPA. This brings in Python 3.11, and I can confirm it corrects the xformers warning. Feel free to steal/borrow what you want from it, or I can do a PR. I also rearranged the Dockerfile to be a little more efficient.

mrbrdo commented 3 months ago

I have the same problem. Also running on TrueNAS, and the latest supported version with the driver is CUDA 12.3. I'm not sure if Python 3.11 was necessary but I was able to get it running now with the above changes. It would be nice to have an image available with CUDA 12.3, since it seems 12.4 is not needed for it to work. I first tried to build the tag 2.3.1, but there seemed to be some issues with outdated library versions.

@ErroneousBosch did you have any problems with permissions of /content/app? I kept getting permission deined / read-only filesystem so temporarily I mounted a dataset into /content/app and manually cloned the git repo in it. I also had to set TRANSFORMERS_CACHE and MPLCONFIGDIR to a dataset location due to same reason. I tried running as 0 (root) and set PUID to 1000 but it did not help. What is your configuration for running user etc. that allows to use /content/app from inside the container? I am using Truecharts Custom App to deploy the docker image, what about you?

ErroneousBosch commented 3 months ago

@mrbrdo Python 3.11 is to get the xformers to work, since they need one sub-point version past what Ubuntu 22.04 has, and the PPA doesn't have python 3.10.13+. I so far have not had any issues with it, and the app seems faster than it was. If you want it all precompiled into an image, I have it up on DockerHub (same username).

I did run into the same issue. I also run with the TrueCharts custom-app instead of the built in option, as it has the option to set the root filesystem as not read only (SecurityContext -> advanced settings, make sure "ReadOnly Root Filesystem" is unchecked).At that point I only need /content/data as a mount. I do also have the runasuser and runasgroup as 1000 too, since that is the id of the internal container user that owns the content/app directory.

ErroneousBosch commented 3 months ago

Testing on my machine, stability seems pretty good, and if anything it seems to run faster than it did before.

Edit: So it will still sometimes give me a crash on a model change, which it was doing before. It is pretty rare though and given the addition of the "Reconnect" feature (thank you!), seems likt that is just a thing that happens.

ErroneousBosch commented 3 months ago

Stability is really rough when changing models if you are going a bit rapid fire. seems like there is some kind of settling period needed between changes, specifically if you are changing checkpoints. Not sure what that period is, but it's there. Not sure how this compares to 12.4 or native.

mrbrdo commented 3 months ago

I've noticed the same, when changing models, I will often get an error. Unfortunately, reconnect does not work for me and I need to restart the container to continue working.

ErroneousBosch commented 3 months ago

@mrbrdo I end up just tapping it before the error message disappears, which seems to reset its conter each time. Eventually the container restarts. and it reconnects.

I had this same issue back under 2.3.1, so not sure what it is.

ErroneousBosch commented 3 months ago

I tried it with a debug-mode, but all I get is that it is killed:

/content/entrypoint.sh: line 33: 22 Killed python launch.py $*

This is identical to what I get if I try to run it without swap on, so I am wondering if it doesn't like the container being limited to 16GB of RAM? Swap is enabled, as the app won't start without it.

ErroneousBosch commented 3 months ago

What might be happening is that Python sees the full amount of system RAM, tries to use more than is allowed, and then crashes when it hits the memory limit. When I run free inside the container, it is showing all of my RAM, not the limit, so this is what makes sense.

To test, I upped it to 32GB, which is half of my 64GB and more than what is usually available. I ran into no stability issues through 12+ rapid fire checkpoint swaps.

@mrbrdo I think that might be the hack for stability: Fooocus is a bit RAM hungry, and TrueNAS's resource limits are not related downstream, so set it to more than what is usually available.

mrbrdo commented 3 months ago

I've had this killed log message as well, especially when I was deploying the container for the first times. I only have 32GB total and I also assigned a 16GB limit. I didn't enable swap though, how do you do that? Are you saying having swap enabled does not help though?

ErroneousBosch commented 3 months ago

@mrbrdo Unless you have disabled it, swap should be on by default. I'd say set max memory to 32gb on your system. Fooocus wants to use whatever is available, and will use swap if it needs to.

lllyasviel / Fooocus