Closed benihime91 closed 5 months ago
Issue:
Hi, I trained a task LoRA on custom dataset using the script under scripts/v1_5/finetune_task_lora.sh. The model has been trained but now i am unable to do any inference.
scripts/v1_5/finetune_task_lora.sh
Command: The following is the final script i used to train the task lora
#!/bin/bash export TIMESTAMP=$(date +"%Y-%m-%d_%H-%M-%S") export WANDB_PROJECT=... deepspeed --include localhost:1,2 /mnt/data1/ayushman/projects/dash-diffusion/LLaVA/llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3.json \ --model_name_or_path liuhaotian/llava-v1.5-13b \ --version v1 \ --data_path ... \ --image_folder ... \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/$TIMESTAMP \ --num_train_epochs 3 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb
The following is the command i use to launch the model worker
CUDA_VISIBLE_DEVICES=1 CUDA_LAUNCH_BLOCKING=1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ${MODEL_PATH} --model-base lmsys/vicuna-13b-v1.5
I am adding the model_worker logs below
Log:
2024-04-16 02:52:43 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='/checkpoints/llava-v1.5-13b-task-lora/2024-04-16_01-38-57/', model_base='lmsys/vicuna-13b-v1.5', model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False, use_flash_attn=False) 2024-04-16 02:52:43 | INFO | model_worker | Loading the model 2024-04-16_01-38-57 on worker dcf422 ... 2024-04-16 02:52:44 | ERROR | stderr | Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] 2024-04-16 02:52:44 | ERROR | stderr | /home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2024-04-16 02:52:44 | ERROR | stderr | return self.fget.__get__(instance, owner)() 2024-04-16 02:52:46 | ERROR | stderr | Loading checkpoint shards: 33%|██████████████████████████▋ | 1/3 [00:01<00:03, 1.92s/it] 2024-04-16 02:52:47 | ERROR | stderr | Loading checkpoint shards: 67%|█████████████████████████████████████████████████████▎ | 2/3 [00:03<00:01, 1.50s/it] 2024-04-16 02:52:48 | ERROR | stderr | Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.14s/it] 2024-04-16 02:52:48 | ERROR | stderr | Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.28s/it] 2024-04-16 02:52:48 | ERROR | stderr | 2024-04-16 02:52:48 | ERROR | stderr | /home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. 2024-04-16 02:52:48 | ERROR | stderr | warnings.warn( 2024-04-16 02:52:48 | ERROR | stderr | /home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. 2024-04-16 02:52:48 | ERROR | stderr | warnings.warn( 2024-04-16 02:52:48 | INFO | stdout | Loading LoRA weights from /mnt/data1/ayushman/projects/dash-diffusion/checkpoints/llava-v1.5-13b-task-lora/2024-04-16_01-38-57 2024-04-16 02:52:52 | INFO | stdout | Merging weights 2024-04-16 02:52:53 | INFO | stdout | Convert to FP16... 2024-04-16 02:52:53 | INFO | model_worker | Register to controller 2024-04-16 02:52:53 | ERROR | stderr | [32mINFO[0m: Started server process [[36m3105957[0m] 2024-04-16 02:52:53 | ERROR | stderr | [32mINFO[0m: Waiting for application startup. 2024-04-16 02:52:53 | ERROR | stderr | [32mINFO[0m: Application startup complete. 2024-04-16 02:52:53 | ERROR | stderr | [32mINFO[0m: Uvicorn running on [1mhttp://0.0.0.0:40000[0m (Press CTRL+C to quit) 2024-04-16 02:52:53 | INFO | stdout | [32mINFO[0m: 127.0.0.1:56118 - "[1mPOST /worker_get_status HTTP/1.1[0m" [32m200 OK[0m 2024-04-16 02:53:08 | INFO | model_worker | Send heart beat. Models: ['2024-04-16_01-38-57']. Semaphore: None. global_counter: 0 2024-04-16 02:53:18 | INFO | model_worker | Send heart beat. Models: ['2024-04-16_01-38-57']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1 2024-04-16 02:53:18 | INFO | stdout | [32mINFO[0m: 127.0.0.1:39104 - "[1mPOST /worker_generate_stream HTTP/1.1[0m" [32m200 OK[0m 2024-04-16 02:53:18 | ERROR | stderr | /home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py:1295: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) 2024-04-16 02:53:18 | ERROR | stderr | warnings.warn( 2024-04-16 02:53:19 | ERROR | stderr | Exception in thread Thread-4 (generate): 2024-04-16 02:53:19 | ERROR | stderr | Traceback (most recent call last): 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-04-16 02:53:19 | ERROR | stderr | self.run() 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/threading.py", line 953, in run 2024-04-16 02:53:19 | ERROR | stderr | self._target(*self._args, **self._kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-04-16 02:53:19 | ERROR | stderr | return func(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 1525, in generate 2024-04-16 02:53:19 | ERROR | stderr | return self.sample( 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2622, in sample 2024-04-16 02:53:19 | ERROR | stderr | outputs = self( 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl 2024-04-16 02:53:19 | ERROR | stderr | return self._call_impl(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl 2024-04-16 02:53:19 | ERROR | stderr | return forward_call(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward 2024-04-16 02:53:19 | ERROR | stderr | outputs = self.model( 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl 2024-04-16 02:53:19 | ERROR | stderr | return self._call_impl(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl 2024-04-16 02:53:19 | ERROR | stderr | return forward_call(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1027, in forward 2024-04-16 02:53:19 | ERROR | stderr | inputs_embeds = self.embed_tokens(input_ids) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl 2024-04-16 02:53:19 | ERROR | stderr | return self._call_impl(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl 2024-04-16 02:53:19 | ERROR | stderr | return forward_call(*args, **kwargs) 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward 2024-04-16 02:53:19 | ERROR | stderr | return F.embedding( 2024-04-16 02:53:19 | ERROR | stderr | File "/home/ayushman/miniforge3/envs/llava/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding 2024-04-16 02:53:19 | ERROR | stderr | return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2024-04-16 02:53:19 | ERROR | stderr | RuntimeError: CUDA error: device-side assert triggered 2024-04-16 02:53:19 | ERROR | stderr | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2024-04-16 02:53:19 | ERROR | stderr | 2024-04-16 02:53:23 | INFO | model_worker | Send heart beat. Models: ['2024-04-16_01-38-57']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1 2024-04-16 02:53:33 | INFO | stdout | Caught Unknown Error 2024-04-16 02:53:33 | INFO | model_worker | Send heart beat. Models: ['2024-04-16_01-38-57']. Semaphore: Semaphore(value=5, locked=False). global_counter: 1 2024-04-16 02:53:38 | INFO | model_worker | Send heart beat. Models: ['2024-04-16_01-38-57']. Semaphore: Semaphore(value=5, locked=False). global_counter: 1
Also adding my env for better visibility
Package Version Editable project location ------------------------- ----------- ------------------------------------------------- accelerate 0.21.0 aiofiles 23.2.1 albumentations 1.4.3 altair 5.3.0 annotated-types 0.6.0 anyio 4.3.0 appdirs 1.4.4 asttokens 2.4.1 attrs 23.2.0 bitsandbytes 0.43.1 blis 0.7.11 braceexpand 0.1.7 cachetools 5.3.3 catalogue 2.0.10 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 colorama 0.4.6 comm 0.2.2 confection 0.1.4 contourpy 1.2.1 cycler 0.12.1 cymem 2.0.8 dataclasses 0.6 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.12.6 docker-pycreds 0.4.0 einops 0.6.1 einops-exts 0.0.4 exceptiongroup 1.2.0 executing 2.0.1 ExifRead-nocycle 3.0.1 fastai 2.7.14 fastapi 0.110.1 fastcore 1.5.29 fastdownload 0.0.7 fastprogress 1.0.3 ffmpy 0.3.2 filelock 3.13.4 fire 0.5.0 flash-attn 2.5.7 fonttools 4.51.0 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.22.2 idna 3.7 imageio 2.34.0 img2dataset 1.45.0 importlib_metadata 7.1.0 importlib_resources 6.4.0 ipykernel 6.29.4 ipython 8.23.0 jedi 0.19.1 Jinja2 3.1.3 joblib 1.4.0 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter_client 8.6.1 jupyter_core 5.7.2 kiwisolver 1.4.5 langcodes 3.3.0 lazy_loader 0.4 llava 1.2.2.post1 /mnt/data1/ayushman/projects/dash-diffusion/LLaVA markdown-it-py 3.0.0 markdown2 2.4.13 MarkupSafe 2.1.5 matplotlib 3.8.4 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 murmurhash 1.0.10 nest_asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.535.133 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 nvitop 1.3.2 opencv-python-headless 4.9.0.80 orjson 3.10.1 packaging 24.0 pandas 2.2.2 parso 0.8.4 peft 0.10.0 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.3.0 pip 24.0 platformdirs 4.2.0 preshed 3.0.9 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 15.0.2 pydantic 2.7.0 pydantic_core 2.18.1 pydub 0.25.1 Pygments 2.17.2 pynvml 11.5.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.0 referencing 0.34.0 regex 2023.12.25 requests 2.31.0 rich 13.7.1 rpds-py 0.18.0 ruff 0.3.7 safetensors 0.4.3 scikit-image 0.23.1 scikit-learn 1.2.2 scipy 1.13.0 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.45.0 setproctitle 1.3.3 setuptools 69.5.1 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smart-open 6.4.0 smmap 5.0.1 sniffio 1.3.1 spacy 3.7.4 spacy-legacy 3.0.12 spacy-loggers 1.0.5 srsly 2.4.8 stack-data 0.6.3 starlette 0.37.2 svgwrite 1.4.3 sympy 1.12 termcolor 2.4.0 thinc 8.2.3 threadpoolctl 3.4.0 tifffile 2024.2.12 timm 0.6.13 tokenizers 0.15.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.1.2 torchvision 0.16.2 tornado 6.4 tqdm 4.66.2 traitlets 5.14.2 transformers 4.37.2 triton 2.1.0 typer 0.9.4 typing_extensions 4.11.0 tzdata 2024.1 urllib3 2.2.1 uvicorn 0.29.0 wandb 0.16.6 wasabi 1.1.2 wavedrom 2.0.3.post3 wcwidth 0.2.13 weasel 0.3.4 webdataset 0.2.86 websockets 11.0.3 wheel 0.43.0 zipp 3.17.0
Screenshots: You may attach screenshots if it better explains the issue.
Hi, did you find the solution or any other way to load the trained lora for inference?
Describe the issue
Issue:
Hi, I trained a task LoRA on custom dataset using the script under
scripts/v1_5/finetune_task_lora.sh
. The model has been trained but now i am unable to do any inference.Command: The following is the final script i used to train the task lora
The following is the command i use to launch the model worker
I am adding the model_worker logs below
Log:
Also adding my env for better visibility
Screenshots: You may attach screenshots if it better explains the issue.