lllyasviel / Fooocus

Focus on prompting and generating
GNU General Public License v3.0
40.01k stars 5.53k forks source link

[Bug]: ROCm Fooocus doesn't garbage collect allocated VRAM #3257

Open infinity0 opened 2 months ago

infinity0 commented 2 months ago

Checklist

What happened?

On ROCm / amdgpu, Fooocus doesn't garbage collect used VRAM even after several hours. This means that other applications, such as other AI image generators, cannot use the VRAM and give "out of memory" errors.

Steps to reproduce the problem

  1. With ROCm with an AMD GPU, use Fooocus as normal - generate a few random pictures with default settings.
  2. Wait a few hours then run radeontop. See that VRAM allocation is still many GB.
  3. Use some other AI tool such as InvokeAI - generate a few random pictures with default settings.
  4. See that this other AI tool gives Out of VRAM errors.
  5. Close Fooocus and repeat (2, without waiting) and (3).
  6. VRAM allocation in radeontop is back down to normal levels. Also, the other AI tool succeeds.

What should have happened?

  1. Fooocus should release VRAM after it is finished generating images.

What browsers do you use to access Fooocus?

No response

Where are you running Fooocus?

Locally

What operating system are you using?

Debian GNU/Linux

Console logs

No relevant logs in Fooocus.

As described above, behaviour is observed empirically via other means, i.e.

1. `radeontop` VRAM usage before and after shutting down Fooocus
2. console logs of {other AI tool} before and after shutting down Fooocus.
   - before: OutOfMemoryError specifically VRAM
   - after: works fine

Additional information

No response

infinity0 commented 2 months ago

Note this problem is unique to Fooocus/ROCm. With InvokeAI/ROCm, I can observe the VRAM being used as the image is generated, but it is correctly released after the generation is finished.

Fooocus however hangs onto the memory indefinitely (I waited literally days), preventing other AI tools from working. There is no UI way to force it to release the memory, the only way is to restart Fooocus. I'm using git master @ 5a71495822a11bbabf7c889eed6d9b38b261bb96 dated 2024-07-01.

infinity0 commented 2 months ago

Also, both InvokeAI and Fooocus are using PyTorch/ROCm, so what I am asking for is clearly possible. Someone more familiar with the code could probably have a look at how InvokeAI handles VRAM allocations, and port that into Fooocus.

mashb1t commented 2 months ago

I assume you're not using low vram mode, which would force unloading after generation (afaik). Fooocus keeps the model loaded depending on configuration and startup arguments. Please provide the startup command for further debugging, thanks.

infinity0 commented 2 months ago

I'm running python3 entry_with_update.py. Problem occurs with any of the flags

Example usage 12940M / 16165M VRAM 80.05% which goes back down to 1594M / 16165M VRAM 9.86% after I close Fooocus.

Low vram mode (--always-low-vram) doesn't seem to help, I waited several minutes after generating and the VRAM usage is still >50%.

mashb1t commented 2 months ago

This is somewhat normal, some things are kept in cache / RAM / VRAM for Fooocus to generate images faster the next time, as they would have to be loaded again. There also currently is no offload button, but --always-offload-from-vram should work.

If you do not want this behaviour you can change the code and try to manually trigger the offload after generation yourself.

https://github.com/lllyasviel/Fooocus/blob/main/ldm_patched/modules/model_management.py#L357

I sadly don't have an AMD card and can't confirm the issue, so please connect with other community members having one by opening a new discussion and by referencing this issue. Thanks!

infinity0 commented 2 months ago

--always-offload-from-vram doesn't work.

mashb1t commented 2 months ago

I got that, but can't confirm for AMD as i don't have an AMD GPU. Please get in touch with other users by opening a new discussion.

infinity0 commented 2 months ago

Are you saying you don't believe bug reports until at least 1 other person have corroborated?? I don't see every issue being duplicated in "Discussions" in this way, but alright if you insist.

In the meantime I've written a script to automatically restart Fooocus if there are no console logs for 120 seconds. For Fooocus this needs to be run as ./timeout.py python -u entry_with_update.py as described.

#!/usr/bin/python
"""Run a command, except kill-and-re-run it if it doesn't produce stdout/stderr
within a given timeout.

If the command is a python script, you MOST LIKELY need to run it as `python -u`
for this wrapper to work properly, since python has nonstandard nonline
buffering by default.
"""
import psutil
import select
import sys
import subprocess
import signal
import threading

output_t_s = 120
sigint_t_s = 10
trmkil_t_s = 5
log_prefix = "================"

autorestart = True

def stop(subproc):
    global autorestart
    autorestart = True # only autorestart if the process was stopped by us

    print(log_prefix, 'send SIGINT', subproc)
    subproc.send_signal(signal.SIGINT)
    # send SIGINT to all children processes too, this matches the behaviour
    # when you ctrl-C in a shell, and is required for many complex programs to
    # interpret SIGINT in the expected way.
    for c in subproc.children(True):
        print(log_prefix, 'send SIGINT', c)
        c.send_signal(signal.SIGINT)

    try:
        subproc.wait(timeout=sigint_t_s)
    except subprocess.TimeoutExpired:
        print(log_prefix, 'send SIGTERM')
        subproc.terminate()
        try:
            subproc.wait(timeout=trmkil_t_s)
        except subprocess.TimeoutExpired:
            print(log_prefix, 'send SIGKILL')
            subproc.kill()
            try:
                subproc.wait(timeout=trmkil_t_s)
            except subprocess.TimeoutExpired:
                pass

def run(args): # run the command which is passed as a parameter to this script
    global autorestart
    autorestart = False # don't autorestart unless we called stop()
    subproc = psutil.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stopper = None

    print(log_prefix, 'running', args, subproc)

    while subproc.returncode is None:
        rs, _, _ = select.select([subproc.stdout, subproc.stderr], [], [], output_t_s)
        for rf in rs:
            data = rf.read1(65536)
            buf = sys.stdout.buffer if rf is subproc.stdout else sys.stderr.buffer
            buf.write(data)
            buf.flush()
        if not rs and stopper is None:
            stopper = threading.Thread(target = lambda: stop(subproc))
            stopper.start()

    if stopper:
        stopper.join()

while autorestart:
    run(sys.argv[1::])

Code uses select.select to work for both stdout+stderr, which is required by Fooocus.

infinity0 commented 2 months ago

I have asked the community here: https://github.com/lllyasviel/Fooocus/discussions/3258

mashb1t commented 2 months ago

Are you saying you don't believe bug reports until at least 1 other person have corroborated?? I don't see every issue being duplicated in "Discussions" in this way, but alright if you insist.

I got that, but can't confirm for AMD as i don't have an AMD GPU.

I also can't debug and/or fix this as i don't have the necessary hardware, so somebody else has to fix this. => asking the community is the next best thing to do, don't you aree?

infinity0 commented 2 months ago

The current code intentionally does not free memory on ROCm, with a comment "seems to make things worse on ROCm".

ldm_patched/modules/model_management.py#L769 - blame, original commit by @lllyasviel

I don't see that it makes anything "worse", so here is a PR that fixes that and makes ROCm behave the same as CUDA: https://github.com/lllyasviel/Fooocus/pull/3262

If @lllyasviel can remember what "worse" actually means, then here is an alternative more conservative PR that forces the free only when --always-offload-from-vram flag is given: https://github.com/lllyasviel/Fooocus/pull/3263

infinity0 commented 2 months ago

With #3262, the current code will free memory between every image generation on ROCm - which is what's already happening on CUDA.

A more ideal behaviour would be to have a timeout to free the memory, so that we don't unnecessarily free it when we are about to immediately generate another image. However the current code doesn't do this for CUDA or anything else, so I consider it out-of-scope for this issue.

fkleon commented 1 month ago

Thanks @infinity0, I have also noticed in the past that the VRAM is not freed while Fooocus is running, needing to shut it down when using other applications wanting to make use of the GPU.

I've tried the fix from #3262 on my system with a RDNA2 card (ROCM 6.1, Kernel 6.7) and it works perfectly fine so far.