replicate / flux-fine-tuner

Cog wrapper for ostris/ai-toolkit + post-finetuning cog inference for flux models
https://replicate.com/ostris/flux-dev-lora-trainer/train
Apache License 2.0
270 stars 29 forks source link

Cog build is broken #51

Open Shakahs opened 6 days ago

Shakahs commented 6 days ago

On a Linux VM with an H100 (not at Replicate). LlaVA dependency is broken as patch_submodules is crashing.

https://github.com/replicate/flux-fine-tuner
cog build
...
 => => naming to docker.io/library/cog-flux-fine-tuner                                                                                                                         0.0s
Validating model schema...

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.15/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.pyenv/versions/3.10.15/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/command/openapi_schema.py", line 46, in <module>
    raise CogError(app.state.setup_result.logs)
cog.errors.CogError: ['Error while loading trainer:\n\nTraceback (most recent call last):\n  File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/server/http.py", l
ine 166, in create_app\n    trainer = load_slim_predictor_from_ref(trainer_ref, "train")\n  File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/predictor.py", line
 228, in load_slim_predictor_from_ref\n    module = load_full_predictor_from_file(module_path, module_name)\n  File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/
predictor.py", line 190, in load_full_predictor_from_file\n    spec.loader.exec_module(module)\n  File "<frozen importlib._bootstrap_external>", line 883, in exec_module\n  File "<
frozen importlib._bootstrap>", line 241, in _call_with_frames_removed\n  File "/src/train.py", line 11, in <module>\n    from submodule_patches import patch_submodules\n  File "/sr
c/submodule_patches.py", line 3, in <module>\n    from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM\nModuleNotFoundError: No module named \'llava\'\n']

ⅹ Failed to get type signature: exit status 1

Linux 7c42e316-fc5f-43c3-9b01-0ec3936fca57 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
(venv) user@7c42e316-fc5f-43c3-9b01-0ec3936fca57:~/flux-fine-tuner$ nvidia-smi
Tue Oct 15 19:57:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:05:00.0 Off |                  Off |
| N/A   29C    P0             74W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
Shakahs commented 6 days ago

On further investigation, this is a documentation issue. The README should make it clear that this repo needs to be cloned recursively, with submodules. Or the Cog tool should handle it automatically.