nixified-ai / flake

A Nix flake for many AI projects
GNU Affero General Public License v3.0
624 stars 69 forks source link

InvokeAI AMD fails to build #82

Closed attilaolah closed 5 months ago

attilaolah commented 5 months ago

I'm running into some trouble trying to build the AMD version with Nix 2.18.1 on my Arch Linux host:

$ nix run github:nixified-ai/flake#invokeai-amd 
warning: ignoring untrusted substituter 'https://ai.cachix.org', you are not a trusted user.
Run `man nix.conf` for more information on the `substituters` configuration option.
warning: Ignoring setting 'auto-allocate-uids' because experimental feature 'auto-allocate-uids' is not enabled
warning: Ignoring setting 'impure-env' because experimental feature 'configurable-impure-env' is not enabled
error: builder for '/nix/store/ahzr7y32hk78b2022m83llmmwrz76939-python3.11-moto-4.2.6.drv' failed with exit code 1;
       last 10 log lines:
       >
       > tests/test_s3/test_s3_multipart.py::test_proxy_mode
       >   /nix/store/ghh2dgix6cfhwglxab5aar02y4qa73xb-python3.11-pytest-7.4.3/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but tests/test_s3/test_s3_multipart.py::test_proxy_mode returned False, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
       >     warnings.warn(
       >
       > -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
       > =========================== short test summary info ============================
       > FAILED tests/test_resourcegroupstaggingapi/test_resourcegroupstagging_glue.py::test_glue_jobs - AssertionError: assert ['sg-name', '...id', 'bc2080'] == ['bc2080']
       > ===== 1 failed, 7946 passed, 8 skipped, 1174 warnings in 449.34s (0:07:29) =====
       > /nix/store/wr08yanv2bjrphhi5aai12hf2qz5kvic-stdenv-linux/setup: line 1559: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix log /nix/store/ahzr7y32hk78b2022m83llmmwrz76939-python3.11-moto-4.2.6.drv'.
error: 1 dependencies of derivation '/nix/store/zvkgh88afhnakiw4x453lmf7zxr4gkzq-python3.11-tensorboardx-2.6.2.drv' failed to build
error: 1 dependencies of derivation '/nix/store/fsn9zzxk3dwdg2i77rdv608fnfri941h-python3.11-pytorch-lightning-1.9.3.drv' failed to build
error: 1 dependencies of derivation '/nix/store/qq86mhm76aqvxcblv993snishwabva8d-python3.11-InvokeAI-3.3.0post3.drv' failed to build

Interestingly trying it a few times gives me different errors, although I suppose that's just due to parallel builds racing each other. In fact, after two failures, the third time it seems to be compiling for hours — maybe somehow I ended up with a cache miss there, and it will just take more time to get to the part where it fails?

My /etc/nix/nix.conf looks like this (comments stripped):

build-users-group = nixbld
max-jobs = auto
experimental-features = nix-command flakes

Even though I have sandboxing enabled, I'm not sure whether I should really trust it, so my plan is to go ahead and retry the whole thing inside a Docker container:

$ docker run --name nixified-ai -it nixos/nix:latest
$ nix-channel --update
$ echo experimental-features = nix-command flakes >> /etc/nix/nix.conf
$ nix run github:nixified-ai/flake#invokeai-amd

So far I haven't gotten there. But it would also be nice to have a basic idea of how long the build should take. I'm running it on a 20 core host with 128G RAM and I'm not doing much else, and so far I'm not even getting a decent progress report, and I'm hesitating to restart it since I don't know if there is a local cache it can pick up from and continue.

attilaolah commented 5 months ago

OK I think this might actually be a duplicate of #13 or #18.

MatthewCroughan commented 5 months ago

No, your computer is just too slow to run the tests, you've hit a race condition in the pytorch test suite.

attilaolah commented 5 months ago

That's a bit sad, I guess I just have to retry it a few times then. For now it seems to be building, I've got a single clang++ process that's been up for 20+ minutes (the rest seem to have completed).

attilaolah commented 5 months ago

OK after a few tries, it built and started InvokeAI. Then it failed to download the models for the first time, but after starting fresh, it seems to have worked for the second time.

I don't mean to be disrespectful here, but for something that is supposed to be reproducible, this sure took a bunch of trial and error. But in the end it worked, so, yay.

MatthewCroughan commented 5 months ago

I don't mean to be disrespectful here, but for something that is supposed to be reproducible

You may need to understand and learn the difference between upstream issues and issues with this project.

The race condition is reproducible, on a slow machine. (issue with pytorch test suite) The issues with downloading models and first-time startup come from the InvokeAI code (invokeai's python code)

All this repo does, is give you somebody elses code and allows you to run it the same way twice. It doesn't mean their code is good, or bug free. You do have a misunderstanding.

attilaolah commented 5 months ago

Yeah, I understand that the race conditions in the test suite are nothing to do with how this project creates a reproducible environment for third party code. I've dealt with third party packages myself enough times, and honestly I usually just go the lazy route and disable the tests altogether.

Again, I'm super happy about all the Nix code here, but I do see some irony in that, I came here for a reproducible environment (after having some issues with ROCm packages elsewhere), and then on the first invocation I ran into flaky tests. I didn't mean to say anything bad about your code here.

As for downloading the models, that's also InvokeAI's code, but again, somehow it bailed out the first time, and then succeeded the second time? I'm suspecting it hid some hidden config in my home dir somewhere, so the second time around it just did something differently. Or maybe that also races. Oh well.

attilaolah commented 5 months ago

In the end, the web UI starts, but trying to generate any output crashes the server, I'm hitting this issue: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/11939

I'm not sure if this is due to my GPU (gfx1010) not supported or some other problem, I'll try with a smaller model maybe.

MatthewCroughan commented 5 months ago

You are not going to like the amount of issues you will encounter with AMD GPUs.

MatthewCroughan commented 5 months ago

You're using Arch, so I can't help you. Whereas if you use NixOS I can give you a few lines of code to correctly define the AMD GPU drivers.

attilaolah commented 5 months ago

OK thanks, I'll eventually set up NixOS on this machine. I'll probably just give up on the current system then, until I migrate to NixOS (for that I need to figure out how to port my current cryptsetup+LVM configs).

The annoying thing is that I do have a two NVIDIA GPUs as well, but on this machine I opted for an AMD one so that window managers like Sway & Hyprland would load properly, since they don't seem to support the proprietary drivers. Since then I've seen that Nix has patches for Hyprland, so maybe back to NVIDIA I should go, but then what do I do with this otherwise perfectly good GPU.

attilaolah commented 4 months ago

I have finally converted my workstation fully to NixOS (with flakes + home manager). I'm going to go ahead and try this once more one of these days and report back.

githubedu commented 3 months ago

Hi @attilaolah Did you got it working?

Came here because I got same error.

I know the CPU is not powerful, it's a rig used previously for mining, GPU shall be good, there is an AMD Vega 64 and a Nvidia 3090.

I installed NixOS and tried to use the AMD GPU, then got the error bellow, next attempt will be with Nvidia, after I figure out how to get the drivers installed in NixOS.

 nix run github:nixified-ai/flake#invokeai-amd  --extra-experimental-features nix-command  --extra-experimental-features  flakes
warning: ignoring untrusted substituter 'https://ai.cachix.org', you are not a trusted user.
Run `man nix.conf` for more information on the `substituters` configuration option.
[4/36/102 built, 57 copied (17279.0/17279.2 MiB), 2829.1 MiB DL] building python3.11-tokenizers-0.14.1 (buildPhase): 📦 Built wheel for CPython 3.11 to /build/source/bindings/python/target/wh[4/39/102 built, 57 copied (17279.0/17279.2 MiB), 2829.1 MiB DL] building onnxruntime-1.15.1 (buildPhase): [ 12%] Building CXX object _deps/abseil_cpp-build/absl/debugging/CMakeFiles/debugg[4error: builder for '/nix/store/ahzr7y32hk78b2022m83llmmwrz76939-python3.11-moto-4.2.6.drv' failed with exit code 1;
       last 10 log lines:
       >
       > tests/test_s3/test_s3_multipart.py::test_proxy_mode
       >   /nix/store/ghh2dgix6cfhwglxab5aar02y4qa73xb-python3.11-pytest-7.4.3/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but tests/test_s3/test_s3_multipart.py::test_proxy_mode returned False, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
       >     warnings.warn(
       >
       > -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
       > =========================== short test summary info ============================
       > FAILED tests/test_core/test_moto_api.py::TestModelDataResetForClassDecorator::test_should_find_bucket - assert [<moto.ec2.mo...7fffeb09ded0>] == []
       > ==== 1 failed, 7946 passed, 8 skipped, 1142 warnings in 5992.35s (1:39:52) =====
       > /nix/store/wr08yanv2bjrphhi5aai12hf2qz5kvic-stdenv-linux/setup: line 1559: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix-store -l /nix/store/ahzr7y32hk78b2022m83llmmwrz76939-python3.11-moto-4.2.6.drv'.
error: 1 dependencies of derivation '/nix/store/zvkgh88afhnakiw4x453lmf7zxr4gkzq-python3.11-tensorboardx-2.6.2.drv' failed to build
error: 1 dependencies of derivation '/nix/store/fsn9zzxk3dwdg2i77rdv608fnfri941h-python3.11-pytorch-lightning-1.9.3.drv' failed to build
error (ignored): error: cannot unlink '/tmp/nix-build-python3.11-torch-2.0.1.drv-0/source': Directory not empty
error: 1 dependencies of derivation '/nix/store/qq86mhm76aqvxcblv993snishwabva8d-python3.11-InvokeAI-3.3.0post3.drv' failed to build
attilaolah commented 3 months ago

No I didn't. But I believe I got an error that is different than yours the last time I tried (on NixOS). I still have the build in cache, so now if I re-run I get the error immediately:

$ nix run github:nixified-ai/flake#invokeai-amd
warning: ignoring untrusted substituter 'https://ai.cachix.org', you are not a trusted user.
Run `man nix.conf` for more information on the `substituters` configuration option.
2024-03-16 10:38:36.198394561 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:1827 CreateInferencePybindStateModule] Init provider bridge failed.
[2024-03-16 10:38:40,216]::[InvokeAI]::INFO --> Patchmatch initialized
/nix/store/knqd0zgkmj3pajqcmh785qc6m8hjf0hc-python3.11-torchvision-0.15.2/lib/python3.11/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(

An exception has occurred: /home/ao/invokeai/models/core/convert/CLIP-ViT-bigG-14-laion2B-39B-b160k is missing
== STARTUP ABORTED ==
** One or more necessary files is missing from your InvokeAI root directory **
** Please rerun the configuration script to fix this problem. **
** From the launcher, selection option [7]. **
** From the command line, activate the virtual environment and run "invokeai-configure --yes --skip-sd-weights" **
** (To skip this check completely, add "--ignore_missing_core_models" to your CLI args. Not installing these core models will prevent the loading of some or all .safetensors and .ckpt files. However, you can always come back and install these core models in the future.)
Press any key to continue...

Looks like it is complaining about a missing model, although I believe the flake should download all the models? But at least the build itself completes for me. You may want to try to run the build several times, since there were race conditions in the PyTorch tests or somewhere, the last time I tried.


EDIT: For now, I'll just try to manually clone the models from HuggingFace into the expected directory to see if this is going to work.

attilaolah commented 3 months ago

Even after fetching the required models from HuggingFace, I still get the error (except the line about the missing models). The command suggests re-running invokeai-configure, which I'm not sure how to do. Maybe running nix develop on the flake, I haven't tried that yet.