LoRA trainer ignores the name of the Dataset and tries to load the non-existent "None.txt"

xjdeng commented 1 year ago

Describe the bug

I'm unable to train any LoRAs because it keeps thinking whatever input I have in the Dataset field is "None.txt" regardless of what I select there.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Download distilgpt2 to the models folder (I'm testing the LoRA training capabilities so I decided to start with a small model as a test. This may not be necessary and you can try it with the model of your choice.)

Download https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json to the training/datasets/ folder

Launch with !python server.py --share --load-in-8bit --model distilgpt2 (or whatever model you have downloaded)

Go to the Training tab

Pick the alpaca_data_cleaned Dataset

Hit "Start LoRA Training"

Screenshot

Logs

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-24sd3pkl04hsi --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true'), PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-03-29 13:56:02.704680: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-29 13:56:02.704798: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-29 13:56:02.704832: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Loading distilgpt2...
Loaded the model in 3.12 seconds.
/usr/local/lib/python3.9/dist-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://b14f2ce6b14766cd43.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Loading raw text file dataset...
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 898, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/utils.py", line 549, in async_iteration
    return next(iterator)
  File "/content/drive/MyDrive/llm/text-generation-webui/modules/training.py", line 124, in do_train
    with open(clean_path('training/datasets', f'{raw_text_file}.txt'), 'r') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/llm/text-generation-webui/training/datasets/None.txt'
Loading raw text file dataset...
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.9/dist-packages/gradio/blocks.py", line 898, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/gradio/utils.py", line 549, in async_iteration
    return next(iterator)
  File "/content/drive/MyDrive/llm/text-generation-webui/modules/training.py", line 124, in do_train
    with open(clean_path('training/datasets', f'{raw_text_file}.txt'), 'r') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/llm/text-generation-webui/training/datasets/None.txt'

System Info

Google Colab notebook running the free GPU (T4) and Python 3.9.  Requirements installed from requirements.txt

oobabooga commented 1 year ago

Can you check if it works now?

xjdeng commented 1 year ago

Can you check if it works now?

Yep, it's able to get past that but now I'm getting a new error. Should I create a new issue?

Traceback (most recent call last): File “/content/drive/MyDrive/llm/text-generation-webui/modules/training.py”, line 190, in do_train lora_model = get_peft_model(shared.model, config) File “/usr/local/lib/python3.9/dist-packages/peft/mapping.py”, line 145, in get_peft_model return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config) File “/usr/local/lib/python3.9/dist-packages/peft/peft_model.py”, line 514, in init super().init(model, peft_config) File “/usr/local/lib/python3.9/dist-packages/peft/peft_model.py”, line 79, in init self.base_model = LoraModel(peft_config, model) File “/usr/local/lib/python3.9/dist-packages/peft/tuners/lora.py”, line 118, in init self._find_and_replace() File “/usr/local/lib/python3.9/dist-packages/peft/tuners/lora.py”, line 181, in _find_and_replace raise ValueError( ValueError: Target modules [‘q_proj’, ‘v_proj’] not found in the base model. Please check the target modules and try again.

It's possible that it's only related to distilgpt2.. i'll try it with another model in the meantime..

Edit: Tried it with Llama-7B and I ran out of ram (physical ram, not GPU ram) on my free colab instance with 12.7 GB. Is this normal? On an unrelated note, I've been unable to load a lot of models that have supposedly lower physical ram requirements (like GPT-4chan which should only need 4-5 GB and yet, I'm running out of ram trying to load them on colab).

kagevazquez commented 1 year ago

Can you check if it works now?

It's working I'm training at the moment

kagevazquez commented 1 year ago

Edit: Tried it with Llama-7B and I ran out of ram (physical ram, not GPU ram) on my free colab instance with 12.7 GB. Is this normal? On an unrelated note, I've been unable to load a lot of models that have supposedly lower physical ram requirements (like GPT-4chan which should only need 4-5 GB and yet, I'm running out of ram trying to load them on colab).

My machine used all 64gb of ram before training and about 10gb more vram while training 7b, its at 19gb of vram atm

xjdeng commented 1 year ago

Training works now with gpt-neo-125M in colab!

oobabooga / text-generation-webui