h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://docs.h2o.ai/h2o-llmstudio/
https://h2o.ai
Apache License 2.0
3.97k stars 412 forks source link

[BUG] Fix support for Runpod: Permission denied #280

Closed Glavin001 closed 1 year ago

Glavin001 commented 1 year ago

πŸ› Bug

There are multiple issues getting the latest LLM Studio working in Runpod.

Root cause

163 and Runpod does not appear to have a way to add to docker call:

-u `id -u`:`id -g`

so we need workarounds.

Workaround

The following template uses an older Docker image of LLM Studio and therefore works.

gcr.io/vorvan/h2oai/h2o-llmstudio@sha256:8c3c86d0e2721c35803924ee6005ccf0ebd8bc3b2568bdca5ca5d202b2074a1c (before #163 )

To Reproduce

Expose TCP Ports: 10101

Env variables: Key Value
H2O_LLM_STUDIO_WORKDIR /home/llmstudio/huggingface-cache/hub
HF_DATASETS_CACHE /home/llmstudio/huggingface-cache/datasets
HUGGINGFACE_HUB_CACHE /home/llmstudio/huggingface-cache/hub
TRANSFORMERS_CACHE /home/llmstudio/huggingface-cache/hub

To Reproduce / Runpod template:

Use the following for all and override as instructed: Volume Mount Path: /workspace Env vars: Key Value
H2O_LLM_STUDIO_WORKDIR /workspace

Symptom:

All of the files and directories in /workspace are owned by root while the user is llmstudio:

llmstudio@/workspace$ pwd
/workspace
llmstudio@/workspace$ ls -la
total 348
drwxr-xr-x 1 root root   4096 Jul 12 04:10 .
drwxr-xr-x 1 root root     50 Jul 13 01:29 ..
-rw-r--r-- 1 root root     25 Jul 12 04:02 .dockerignore
-rw-r--r-- 1 root root    308 Jul 12 04:02 .flake8
-rw-r--r-- 1 root root   2023 Jul 12 04:02 .gitignore
-rw-r--r-- 1 root root   5483 Jul 12 04:02 CODE_OF_CONDUCT.md
-rw-r--r-- 1 root root   1432 Jul 12 04:02 Dockerfile
-rw-r--r-- 1 root root  10764 Jul 12 04:02 LICENSE
-rw-r--r-- 1 root root   2953 Jul 12 04:02 Makefile
-rw-r--r-- 1 root root   1559 Jul 12 04:02 Pipfile
-rw-r--r-- 1 root root 197204 Jul 12 04:02 Pipfile.lock
-rw-r--r-- 1 root root  20702 Jul 12 04:02 README.md
-rw-r--r-- 1 root root   1194 Jul 12 04:02 app.py
drwxr-xr-x 4 root root   4096 Jul 12 04:02 app_utils
-rw-r--r-- 1 root root     80 Jul 12 04:02 distributed_train.sh
drwxr-xr-x 3 root root    168 Jul 12 04:02 documentation
drwxr-xr-x 2 root root     35 Jul 12 04:02 examples
-rw-r----- 1 root root   2279 Jul 12 04:02 gha-creds-ff60b99e1c1c0282.json
drwxr-xr-x 3 root root     20 Jul 12 04:02 jenkins
drwxr-xr-x 4 root root     58 Jul 12 04:02 llm_studio
-rw-r--r-- 1 root root   7195 Jul 12 04:02 model_card_template.md
-rw-r--r-- 1 root root   4543 Jul 12 04:02 prompt.py
drwxr-xr-x 2 root root     31 Jul 12 04:02 prompts
-rw-r--r-- 1 root root    452 Jul 12 04:02 pyproject.toml
-rw-r--r-- 1 root root   5183 Jul 12 04:02 requirements.txt
drwxr-xr-x 6 root root     71 Jul 12 04:02 tests
-rw-r--r-- 1 root root  27576 Jul 12 04:02 train.py
-rw-r--r-- 1 root root   5576 Jul 12 04:02 train_wave.py

Fortunately, we can use /home/llmstudio directory instead which is owned by llmstudio user:

llmstudio@~$ pwd
/home/llmstudio
llmstudio@$ ls -la
total 20
drwxrwxrwx 8 llmstudio llmstudio  189 Jul 13 01:49 .
drwxr-xr-x 1 root      root        23 Jul 12 04:05 ..
-rw-rw-rw- 1 llmstudio llmstudio  220 Jul 12 04:05 .bash_logout
-rw-rw-rw- 1 llmstudio llmstudio 3771 Jul 12 04:05 .bashrc
drwxrwxrwx 4 llmstudio llmstudio   31 Jul 13 01:29 .cache
-rw-r--r-- 1 llmstudio llmstudio   42 Jul 13 01:32 .gotty.auth
drwxrwxrwx 5 llmstudio llmstudio   41 Jul 13 01:29 .local
drwx------ 3 llmstudio llmstudio   26 Jul 13 01:29 .nv
-rw-rw-rw- 1 llmstudio llmstudio  807 Jul 12 04:05 .profile
-rw------- 1 llmstudio llmstudio  487 Jul 13 01:49 .python_history
drwxr-xr-x 4 llmstudio llmstudio   29 Jul 13 01:30 data
drwxr-xr-x 4 llmstudio llmstudio   33 Jul 13 01:30 huggingface-cache
drwxr-xr-x 4 llmstudio llmstudio   34 Jul 13 01:30 output

Workaround: use /home/llmstudio for the Volume Mount Path, since it does have the necessary writable permissions.

Now the page now loads!

image

Unfortunately, when trying to create an experiment we hit another blocker:

image

The copy_config does not respect the H2O_LLM_STUDIO_WORKDIR: https://github.com/h2oai/h2o-llmstudio/blob/main/app_utils/utils.py#L1874-L1875

To Reproduce:

Volume Mount Path: /home/llmstudio Env vars: Key Value
H2O_LLM_STUDIO_WORKDIR /home/llmstudio

Symptom:

Logs after attempting to create an experiment show:

2023-07-13T01:32:25.269827597Z 2023-07-13 01:32:25,269 - INFO: Initializing client True
2023-07-13T01:32:25.328028318Z 2023-07-13 01:32:25,327 - INFO: {'dataset/list', 'dataset/import', 'home/disk_usage', 'experiment/list', 'dataset/display/footer', 'home/compute_stats', 'init_app', 'home/experiments_stats', 'dataset/import/footer', 'home/gpu_stats'}
2023-07-13T01:32:25.331135216Z 2023-07-13 01:32:25,330 - INFO: PREV None text_causal_language_modeling_config None 1 None None 
2023-07-13T01:32:25.331145754Z 2023-07-13 01:32:25,330 - INFO: Starting from CFG
2023-07-13T01:32:25.341542900Z 2023-07-13 01:32:25,341 - INFO: From dataset True
2023-07-13T01:32:25.341559254Z 2023-07-13 01:32:25,341 - INFO: From cfg True
2023-07-13T01:32:25.341560726Z 2023-07-13 01:32:25,341 - INFO: From default True
2023-07-13T01:32:25.341561547Z 2023-07-13 01:32:25,341 - INFO: Config file: text_causal_language_modeling_config
2023-07-13T01:32:37.906757439Z INFO:     127.0.0.1:48968 - "POST / HTTP/1.1" 200 OK
2023-07-13T01:32:37.906781472Z 2023-07-13 01:32:37,906 - INFO: Initializing client True
2023-07-13T01:32:37.962132956Z 2023-07-13 01:32:37,961 - INFO: Starting experiment
2023-07-13T01:32:37.962149304Z 2023-07-13 01:32:37,961 - INFO: experiment/start/cfg_file
2023-07-13T01:32:37.962150018Z 2023-07-13 01:32:37,962 - INFO: CFG: text_causal_language_modeling_config
2023-07-13T01:32:37.964742580Z 2023-07-13 01:32:37,964 - ERROR: Unknown exception
2023-07-13T01:32:37.964749140Z Traceback (most recent call last):
2023-07-13T01:32:37.964750032Z   File "/workspace/./app_utils/handlers.py", line 167, in handle
2023-07-13T01:32:37.964750545Z     await experiment_run(q, pre="experiment/start")
2023-07-13T01:32:37.964751988Z   File "/workspace/./app_utils/sections/experiment.py", line 510, in experiment_run
2023-07-13T01:32:37.964752738Z     start_experiment(cfg=cfg, q=q, pre=pre)
2023-07-13T01:32:37.964753244Z   File "/workspace/./app_utils/utils.py", line 1587, in start_experiment
2023-07-13T01:32:37.964753808Z     cfg = copy_config(cfg)
2023-07-13T01:32:37.964754761Z   File "/workspace/./app_utils/utils.py", line 1874, in copy_config
2023-07-13T01:32:37.964755289Z     os.makedirs("output", exist_ok=True)
2023-07-13T01:32:37.964762872Z   File "/usr/lib/python3.10/os.py", line 225, in makedirs
2023-07-13T01:32:37.964763400Z     mkdir(name, mode)
2023-07-13T01:32:37.964764390Z PermissionError: [Errno 13] Permission denied: 'output'
2023-07-13T01:32:37.964765076Z 2023-07-13 01:32:37,964 - INFO: {'experiment/start/footer', 'dataset/list', 'dataset/import', 'home/disk_usage', 'experiment/start', 'experiment/list', 'dataset/display/footer', 'home/compute_stats', 'init_app', 'home/experiments_stats', 'dataset/import/footer', 'home/gpu_stats'}

And I can reproduce:

$ cd /workspace
$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.makedirs("output")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: 'output'

More details coming as I learn more.

Glavin001 commented 1 year ago

Fixed with https://github.com/h2oai/h2o-llmstudio/pull/281