macOS: `ModuleNotFoundError: No module named 'ldm'`

psychedelicious commented 2 years ago

Describe your environment

GPU: mps
VRAM: 32 GB unified memory
CPU arch: arm64
OS: macOS 12.5.1
Python: mamba
Branch: development 810112577fb0ec352a9f51b6e796cbdcef9fcac4

Describe the bug On a fresh environment, running python scripts/dream.py fails:

Traceback (most recent call last):
  File "/Users/spencer/Documents/Code/stable-diffusion/scripts/dream.py", line 10, in <module>
    import ldm.dream.readline
ModuleNotFoundError: No module named 'ldm'

Reported by i3oc9i on discord, who found that the issue appears at this commit and did much of the troubleshooting: e33ed45cfc030a8a454fcc49d0b6ebf991a7e079 -Delete redundant backquotes #665

I had an environment from before that commit. With an updated development branch commit 810112577fb0ec352a9f51b6e796cbdcef9fcac4 , I ran mamba env update -f environment-mac.yaml and now my environment is broken.

Making a new env after checking out a commit prior to the problematic one, issue no longer occurs.

That problematic commit adds - protobuf==3.20.1 to environment-mac.yaml. Removing that line and making a fresh environment - leaving everything else the same - does not fix the issue
If you move dream.py to the project root, it works.
If you delete the line import ldm.dream.readline, you get the same error referring to the next line.
If you open the python REPL, you can import ldm.dream.readline just fine

So it seems that the current working directory is somehow not passed to dream.py.

I haven't the slightest clue what could cause this.

psychedelicious commented 2 years ago

i3oc9i also found that the github workflow is stuck with the same error: https://github.com/invoke-ai/InvokeAI/actions/runs/3094254306/jobs/5007435455

psychedelicious commented 2 years ago

Ok, so the scripts directory is not being added to sys.path when running python scripts/dream.py. It also affects running other scripts e.g. python backend/server.py.

Manually adding the scripts directory (e.g. sys.path.append("/Users/spencer/Documents/Code/stable-diffusion/")) or "" (sys.path.append("")) fixes it.

I have tried to figure out what changed to cause this issue but am at a loss.

Resident pythonista @tildebyte ?

lstein commented 2 years ago

It's not happening on my linux system, but this issue has come up before in a Windows-specific fashion.

Two things to try:

pip install -e . (don't forget the dot at the end)
In dream.py, move the line sys.path.append('.') to the line right after import sys at the very top.

I have no freakin' idea why that commit got so messed up. The original PR was just a typographical fix to a single file, but instead ended up pulling in changes from @psychedelicious 's cleanup of the web stuff. I am learning to not click on the PR button that says "update branch" when I see the message "This branch is out-of-date with the base branch." I think the right thing to do is to rebase - would like @tildebyte 's advice too.

psychedelicious commented 2 years ago

~pip install -e . did fix my environment, thanks!~

Edit: It worked for one run of the script, but then it stopped working. I had to move the sys.path.append('.') up before and local imports.

Unfortunately the issue is reproducible and happens on a fresh conda install after that problematic commit. What could have changed to cause this issue?

Moving the sys.path.append('.') also would work but given that hasn't changed when then problematic commit occurred, I feel like moving it is bypassing an issue somewhere else.

lstein commented 2 years ago

I think the key clue here is thatpip install -e . worked once, and then it failed on subsequent tries. Something that happens during that first run of the dream script is stably modifying the local environment. I can't imagine what it could be.

One way to help reduce the problem search space is to first confirm that pip install -e . fixes the problem once. Then try launching the backend/server.py script and see if the problem occurs there as well. If it does, we can assume that whatever the problem is, it is occurring in the common modules shared by the dream script and the web backend.

It also might be helpful to insert print(sys.path) at various strategic locations in the code. In my environment, before running dream.py or anything like that, it looks like this:

['', '/u/lstein/projects/SD/stable-diffusion', '/usr/share/pyshared', '/u/lstein/.conda/envs/ldm/lib/python39.zip',
 '/u/lstein/.conda/envs/ldm/lib/python3.9', '/u/lstein/.conda/envs/ldm/lib/python3.9/lib-dynload', 
'/u/lstein/.conda/envs/ldm/lib/python3.9/site-packages', '/u/lstein/projects/SD/stable-diffusion/src/gfpgan', 
'/u/lstein/projects/SD/stable-diffusion/src/clip', '/u/lstein/projects/SD/stable-diffusion/src/taming-transformers', 
'/u/lstein/projects/SD/stable-diffusion/src/k-diffusion']

I bet at some point during the first run of dream.py, the entry that points at the stable-diffusion directory will disappear.

psychedelicious commented 2 years ago

Hmm. The scripts worked once then I got the same error. After that pip install -e . didn't work. I wiped out all my envs and started fresh from development and pip install -e . didn't fix the error. Maybe wiping out my local copy of the repo and starting totally fresh will affect it.

Anyways, I'll revisit the issue tomorrow. For now I've just moved the sys.path.append(".") above all of the ldm imports.

holstvoogd commented 2 years ago

I've done a bit of checking with git bisect, it seems 7b0cbb34d618098b4072f14870937ee9eb4369a1 causes this issue

EDIT: Tried narrowing it down more, by my python knowledge is roughly 0, so I'll leave that to someone who speaks python ;)

lstein commented 2 years ago

I'm a bit hamstrung here because everything's hunky-dory on my Linux system. If worst comes to worst, I'll back out all the changes to https://github.com/invoke-ai/InvokeAI/commit/7b0cbb34d618098b4072f14870937ee9eb4369a1 and reconstruct. It's a pity, because there were a lot of new features there, including improvements to the WebUI and outpainting.

Have any of the Mac users experienced this regression?

holstvoogd commented 2 years ago

yep, I'm on a mac (m1 mba); I'm having another look at it later today

lstein commented 2 years ago

This is what comes of reading bug reports late at night. I totally lost track of the fact that this was reported on a Mac system. I got fixated on Windows in some way. Apologies.

One big difference between my environment and the Mac environment is that I'm using Python 3.9 and the Mac environment is 3.10. I will try 3.10 and see if I can reproduce.

Can any Windows users confirm that this bug appears on their systems?

holstvoogd commented 2 years ago

Wait, ldm.gfpgan.gfpgan_tools was removed, should that just be ldm.restoration.gfpgan.gfpgan orso?

EDIT: ok, nvm, bit out of my league here wrt python.

Both from ldm.gfpgan.gfpgan_tools import real_esrgan_upscale and from ldm.gfpgan.gfpgan_tools import run_gfpgan do not exists anymore as the gfpgan_tools.py is gone, but are used in server.py. Perhaps some changes got lost?

lstein commented 2 years ago

I believe that gfpgan_tools was refactored and is no longer needed, but I'm checking to make sure that this is the case.

lstein commented 2 years ago

That's an uncaught bug in server.py, and I'll fix.

UPDATE: Which server.py? Is it backend/server.py or ldm/server.py?

holstvoogd commented 2 years ago

sorry! it's in backend/server.py. Usages:

https://github.com/invoke-ai/InvokeAI/blob/19174949b6eafe57d576633d4e2c6979e8cc03a9/backend/server.py#L653 https://github.com/invoke-ai/InvokeAI/blob/19174949b6eafe57d576633d4e2c6979e8cc03a9/backend/server.py#L207 https://github.com/invoke-ai/InvokeAI/blob/19174949b6eafe57d576633d4e2c6979e8cc03a9/backend/server.py#L635

And the imports: https://github.com/invoke-ai/InvokeAI/blob/19174949b6eafe57d576633d4e2c6979e8cc03a9/backend/server.py#L21

psychedelicious commented 2 years ago

@holstvoogd Yes that's expected, backend/server.py is being updated now to use the new restoration module

holstvoogd commented 2 years ago

Ah, yeah, I see now. nevermind all my comments then. I hadn't noticed this was actually not about backend/server.py 🤦‍♂️

lstein commented 2 years ago

Oh, I was just testing my fixes to backend/server.py. I will wait for @psychedelicious to commit his PR and work on dream/server.py instead. I do have it working if you want it. The only problem is that I had to hardcode constants for the locations of the GFPGAN directory, etc, because I wasn't sure where they come from in @psychedelicious 's code.

Here's the diff in case it is useful. The first bit is just changes needed to connect on my firewalled system.

diff --git a/backend/server.py b/backend/server.py
index 11d6c61..9302859 100644
--- a/backend/server.py
+++ b/backend/server.py
@@ -18,9 +18,8 @@ from threading import Event
 from uuid import uuid4
 from send2trash import send2trash

-from ldm.gfpgan.gfpgan_tools import real_esrgan_upscale
-from ldm.gfpgan.gfpgan_tools import run_gfpgan
 from ldm.generate import Generate
+from ldm.dream.restoration import Restoration
 from ldm.dream.pngwriter import PngWriter, retrieve_metadata
 from ldm.dream.args import APP_ID, APP_VERSION, calculate_init_img_hash
 from ldm.dream.conditioning import split_weighted_subprompts
@@ -34,11 +33,12 @@ USER CONFIG

 output_dir = "outputs/"  # Base output directory for images
 # host = 'localhost'  # Web & socket.io host
-host = "localhost"  # Web & socket.io host
+host = "0.0.0.0"  # Web & socket.io host
 port = 9090  # Web & socket.io port
 verbose = False  # enables copious socket.io logging
 additional_allowed_origins = [
-    "http://localhost:5173"
+    "http://localhost:5173",
+    "http://localhost:9090",
 ]  # additional CORS allowed origins
 model = "stable-diffusion-1.4"

@@ -46,12 +46,15 @@ model = "stable-diffusion-1.4"
 END USER CONFIG
 """

+# Face Restoration constants that need to be replaced by user configuration
+GFPGAN_DIR        = './src/gfpgan'
+GFPGAN_MODEL_PATH = 'experiments/pretrained_models/GFPGANv1.3.pth'
+ESRGAN_BG_TILE    =  400

 """
 SERVER SETUP
 """

-
 # fix missing mimetypes on windows due to registry wonkiness
 mimetypes.add_type("application/javascript", ".js")
 mimetypes.add_type("text/css", ".css")
@@ -204,13 +207,15 @@ def handle_run_esrgan_event(original_image, esrgan_parameters):
     socketio.emit("progressUpdate", progress)
     eventlet.sleep(0)

-    image = real_esrgan_upscale(
+    # this could be done at initialization time
+    restoration = Restoration(GFPGAN_DIR,GFPGAN_MODEL_PATH,ESRGAN_BG_TILE)
+    esrgan      = restoration.load_ersgan()
+    image       = esrgan.process(
         image=image,
         upsampler_scale=esrgan_parameters["upscale"][0],
         strength=esrgan_parameters["upscale"][1],
         seed=seed,
     )
-
     progress["currentStatus"] = "Saving image"
     socketio.emit("progressUpdate", progress)
     eventlet.sleep(0)
@@ -275,7 +280,10 @@ def handle_run_gfpgan_event(original_image, gfpgan_parameters):
     socketio.emit("progressUpdate", progress)
     eventlet.sleep(0)

-    image = run_gfpgan(
+    # this could be done at initialization time
+    restoration = Restoration(GFPGAN_DIR,GFPGAN_MODEL_PATH,ESRGAN_BG_TILE)
+    gfpgan      = restoration.load_gfpgan()
+    image = gfpgan.process(
         image=image,
         strength=gfpgan_parameters["gfpgan_strength"],
         seed=seed,

lstein commented 2 years ago

The legacy server application doesn't have this problem because it relies on generate() to run the upscaling tools.

holstvoogd commented 2 years ago

I've found a solution for the issue with script/dream.py:

@@ -47,7 +47,7 @@ def main():
     # Loading Face Restoration and ESRGAN Modules
     try:
         gfpgan, codeformer, esrgan = None, None, None
-        from ldm.dream.restoration import Restoration
+        from ldm.dream.restoration.base import Restoration
         restoration = Restoration(opt.gfpgan_dir, opt.gfpgan_model_path, opt.esrgan_bg_tile)
         if opt.restore:
             gfpgan, codeformer = restoration.load_face_restore_models()

Restoration was moved from ldm/restoration/restoration.py to ldm/dream/restoration/base.py, that seems to cause this.

holstvoogd commented 2 years ago

Ok, Sorry for the confusion earlier! I was a bit to eager to help & doing other work at the same time.

Now, I've taken time to look closely at what I am actually doing & I can confirm that:

python scripts/dream.py is still broken for new enviroments on the latest development commit
git bisect suggests this is the breaking commit for me: c0e1fb5f7144995adbea3268f4c2e564aeca229b
that makes no sense to me, but removing pyproject.toml & running conda env update fixes the issue. also on the latest versions
running conda update multiple times does not change the behavior
an empty pyproject.toml also causes issues

I'm not sure, but this feels like a conda bug tbh. Anyway, removing pyproject.toml fixes the issues with ModuleNotFoundError: No module named 'ldm'

lstein commented 2 years ago

Great detective work! Thanks for tracking down the path issue to pyproject.toml. I share your puzzlement. Are you 100% sure that removing this file and running conda env update fixes the problem completely? Perhaps @tildebyte can shed some light on this. Perhaps there is an interaction between conda and this file that I'm not aware of.

I'm happy to work around the problem for now just by adding the sys.path('.') line to the top of dream.py. bakend/server.py already does this. Long run I want to understand why the module loading path is getting screwed up. Did you ever try printing the contents of sys.path using print(sys.path)? The first or second entry should be the absolute pathname of the InvokeAI (or stable-diffusion) directory. If it's not, then some interaction with pyproject.toml must be occurring that alters it.

PR #732, which was just committed to development, should fix python scripts/dream.py. Please report if it doesn't. There is a bug tracking issue #619 specifically set up for reporting WebGUI bugs.

holstvoogd commented 2 years ago

Yes, I've tried several times to be sure :) It seems pyproject.toml is set to replace setup.py. So conda ignores(?) setup.py when it sees the pyproject.toml & since the pyproject.toml has no build config etc, it breaks. I've take a quick look at migrating setup.py and it is supposed to be super easy, but I couldn't figure it out tbh.

As for print(sys.path) just before the erroring line:

['/Users/arthur/Projects/SD/stable-diffusion/scripts', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python310.zip', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10/lib-dynload', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10/site-packages', '/Users/arthur/Projects/SD/stable-diffusion/src/taming-transformers', '/Users/arthur/Projects/SD/stable-diffusion/src/clip', '/Users/arthur/Projects/SD/stable-diffusion/src/gfpgan']

And with the toml removed:

['/Users/arthur/Projects/SD/stable-diffusion/scripts', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python310.zip', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10/lib-dynload', '/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.10/site-packages', '/Users/arthur/Projects/SD/stable-diffusion/src/taming-transformers', '/Users/arthur/Projects/SD/stable-diffusion', '/Users/arthur/Projects/SD/stable-diffusion/src/clip', '/Users/arthur/Projects/SD/stable-diffusion/src/gfpgan']

Since that is not very readable: In the former case the stable-diffusion directory was not included, in the latter it is. Otherwise the paths are the same

backend/server.py works again with that PR merged! 👍 thanks for all the hard work on this!

tildebyte commented 2 years ago

Got distracted, sorry: TL;DR - @lstein please revert the pyproject.toml commit

conda is doing something "convenient" (HUGE "airquotes") again, i.e. processing the pyproject.toml and taking action based on its mere existence, even thought there are no directives in there for it.

FWIW I'm still working on dropping conda, but I'm still seeing unreproducible-on-my-end reports of pip/pew not working properly even on Windows...

FTR, I've built and rebuilt all of my stable and dev venvs locally on Windows 11 with Python3.10 using pip/pew since I can't even remember how long ago... without major incident. Occasionally I hit a dumb typo or something, but it. just. works. here. This isn't to point fingers at users and say "your fault", but rather to express my frustration at not being able to repro install issues locally...

lstein commented 2 years ago

OK. I'm going to rename pyproject.toml to pyproject.toml.hide. Hidden behaviors are very frustrating. I wonder if this has something to do with the renaming of the repository?

tildebyte commented 2 years ago

@lstein;

I wonder if this has something to do with the renaming of the repository?

Almost definitely not. It's some weird behavior of conda trying to use the 'pyproject.toml' (which can be used similarly to 'requirements.txt' or 'environment.yaml') during the install...

psychedelicious commented 2 years ago

Goodness. That was a saga. Poor little Toml, didn’t know the chaos he sowed. Thanks for the in depth troubleshooting @holstvoogd !

taishi55 commented 1 year ago

I'm having the same error message. How can I fix it?

crawlchange commented 1 year ago

Still the same issue for fresh installs. The basic documentation installation step-by-step is broken. This is serious.

invoke-ai / InvokeAI

macOS: `ModuleNotFoundError: No module named 'ldm'` #723