On a Linux VM with an H100 (not at Replicate).
LlaVA dependency is broken as patch_submodules is crashing.
https://github.com/replicate/flux-fine-tuner
cog build
...
=> => naming to docker.io/library/cog-flux-fine-tuner 0.0s
Validating model schema...
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.15/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/.pyenv/versions/3.10.15/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/command/openapi_schema.py", line 46, in <module>
raise CogError(app.state.setup_result.logs)
cog.errors.CogError: ['Error while loading trainer:\n\nTraceback (most recent call last):\n File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/server/http.py", l
ine 166, in create_app\n trainer = load_slim_predictor_from_ref(trainer_ref, "train")\n File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/predictor.py", line
228, in load_slim_predictor_from_ref\n module = load_full_predictor_from_file(module_path, module_name)\n File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/
predictor.py", line 190, in load_full_predictor_from_file\n spec.loader.exec_module(module)\n File "<frozen importlib._bootstrap_external>", line 883, in exec_module\n File "<
frozen importlib._bootstrap>", line 241, in _call_with_frames_removed\n File "/src/train.py", line 11, in <module>\n from submodule_patches import patch_submodules\n File "/sr
c/submodule_patches.py", line 3, in <module>\n from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM\nModuleNotFoundError: No module named \'llava\'\n']
ⅹ Failed to get type signature: exit status 1
Linux 7c42e316-fc5f-43c3-9b01-0ec3936fca57 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
(venv) user@7c42e316-fc5f-43c3-9b01-0ec3936fca57:~/flux-fine-tuner$ nvidia-smi
Tue Oct 15 19:57:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | Off |
| N/A 29C P0 74W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
On further investigation, this is a documentation issue.
The README should make it clear that this repo needs to be cloned recursively, with submodules.
Or the Cog tool should handle it automatically.
On a Linux VM with an H100 (not at Replicate). LlaVA dependency is broken as
patch_submodules
is crashing.