Open raghavbj24 opened 9 months ago
Hi, would it be possible for someone to kindly assist me with this error?
I'm trying the same thing but I get an error a lot earlier:
PyTorch version 2.2.0 available.
███████████████████████
█ █ █ █ ▜█ █ █ █ █ █
█ █ █ █ █ █ █ █ █ █ ███
█ █ █ █ █ █ █ █ █ ▌ █
█ █████ █ █ █ █ █ █ █ █
█ █ ▟█ █ █ █
███████████████████████
ludwig v0.9.3 - Train
Traceback (most recent call last):
File "/home/azureuser/ludwig/venv/bin/ludwig", line 8, in <module>
sys.exit(main())
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
CLI()
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
getattr(self, args.command)()
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
train.cli(sys.argv[2:])
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
train_cli(**vars(args))
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/train.py", line 176, in train_cli
model = LudwigModel(
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/api.py", line 317, in __init__
self.config_obj = ModelConfig.from_dict(self._user_config)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/schema/model_types/base.py", line 141, in from_dict
config_obj: ModelConfig = schema.load(config)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/marshmallow_dataclass/__init__.py", line 730, in load
return clazz(**all_loaded)
File "<string>", line 18, in __init__
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/schema/model_types/base.py", line 73, in __post_init__
set_llm_parameters(self)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/schema/model_types/utils.py", line 314, in set_llm_parameters
_set_generation_max_new_tokens(config)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/schema/model_types/utils.py", line 401, in _set_generation_max_new_tokens
max_possible_sequence_length = _get_maximum_possible_sequence_length(config, _DEFAULT_MAX_SEQUENCE_LENGTH)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/ludwig/schema/model_types/utils.py", line 377, in _get_maximum_possible_sequence_length
model_config = AutoConfig.from_pretrained(config.base_model)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 634, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
resolved_config_file = cached_file(
File "/home/azureuser/ludwig/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 356, in cached_file
raise EnvironmentError(
OSError: /home/azureuser/ludwig/results/experiment_run_50/model/model_weights does not appear to have a file named config.json. Checkout 'https://huggingface.co//home/azureuser/ludwig/results/experiment_run_50/model/model_weights/None' for available files.
How did you get the model to load in the first place?
Looks like there is an issue with the generated model itself on my end.
Hi, We are trying to perform incremental training...but we are facing the following error
Complete log file->
Read the training data into a dataframe.. Reading the training data into a dataframe has been completed.. Setting up the HuggingFace API Token.. Huggingface token is added in the environment.. Load the Ludwig configuration YAML file.. Loading the Ludwig configuration YAML file has been completed.. Loading the Base Model.. Setting generation max_new_tokens to 512 to correspond with the max sequence length assigned to the output feature or the global max sequence length. This will ensure that the correct number of tokens are generated at inference time. To override this behavior, set `generation.max_new_tokens` to a different value in your Ludwig config. Loading the trained Base Model has been completed.. Starting the Fine Tuning.. ╒════════════════════════╕ │ EXPERIMENT DESCRIPTION │ ╘════════════════════════╛ ╒══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════╕ │ Experiment name │ api_experiment │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Model name │ run │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Output directory │ /home/ubuntu/results/api_experiment_run_19 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ ludwig_version │ '0.9.3' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ command │ '/home/ubuntu/train_llama-2_7b_Log_Analytics_8bit_merged_v8/codebase/train_llama_using_ludwig.py' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ random_seed │ 42 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ data_format │ "<class 'pandas.core.frame.DataFrame'>" │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ torch_version │ '2.1.0+cu121' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ compute │ { 'arch_list': [ 'sm_50', │ │ │ 'sm_60', │ │ │ 'sm_70', │ │ │ 'sm_75', │ │ │ 'sm_80', │ │ │ 'sm_86', │ │ │ 'sm_90'], │ │ │ 'devices': { 0: { 'device_capability': (8, 6), │ │ │ 'device_properties': "_CudaDeviceProperties(name='NVIDIA " │ │ │ "A10G', major=8, minor=6, " │ │ │ 'total_memory=22723MB, ' │ │ │ 'multi_processor_count=80)', │ │ │ 'gpu_type': 'NVIDIA A10G'}}, │ │ │ 'gencode_flags': '-gencode compute=compute_50,code=sm_50 -gencode ' │ │ │ 'compute=compute_60,code=sm_60 -gencode ' │ │ │ 'compute=compute_70,code=sm_70 -gencode ' │ │ │ 'compute=compute_75,code=sm_75 -gencode ' │ │ │ 'compute=compute_80,code=sm_80 -gencode ' │ │ │ 'compute=compute_86,code=sm_86 -gencode ' │ │ │ 'compute=compute_90,code=sm_90', │ │ │ 'gpus_per_node': 1, │ │ │ 'num_nodes': 1} │ ╘══════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════╛ ╒═══════════════╕ │ LUDWIG CONFIG │ ╘═══════════════╛ User-specified config (with upgrades): { 'adapter': { 'alpha': 16, 'bias_type': 'none', 'dropout': 0.05, 'postprocessor': { 'merge_adapter_into_base_model': True, 'progressbar': True}, 'pretrained_adapter_weights': None, 'r': 8, 'target_modules': None, 'type': 'lora'}, 'backend': {'type': 'local'}, 'base_model': '/home/ubuntu/results/api_experiment_run_15/model/model_weights', 'input_features': [ { 'name': 'prompt', 'preprocessing': {'max_sequence_length': 1024}, 'type': 'text'}], 'ludwig_version': '0.9.3', 'model_type': 'llm', 'output_features': [ { 'name': 'Response', 'preprocessing': {'max_sequence_length': 512}, 'type': 'text'}], 'preprocessing': {'sample_ratio': 1.0}, 'prompt': { 'template': '### Instruction:\n' '{Instruction}\n' '\n' '### Context:\n' '{Context}\n' '\n' '### Response:\n'}, 'quantization': {'bits': 8}, 'trainer': { 'batch_size': 1, 'enable_gradient_checkpointing': True, 'epochs': 3, 'gradient_accumulation_steps': 1, 'learning_rate': 0.0001, 'learning_rate_scheduler': {'warmup_fraction': 0.01}, 'max_batch_size': 1, 'type': 'finetune'}} Full config saved to: /home/ubuntu/results/api_experiment_run_19/api_experiment/model/model_hyperparameters.json ╒═══════════════╕ │ PREPROCESSING │ ╘═══════════════╛ No cached dataset found at /home/ubuntu/eeff4f02cfeb11ee808f12abaaebd043.training.hdf5. Preprocessing the dataset. Using full dataframe Building dataset (it may take a while) Loaded HuggingFace implementation of /home/ubuntu/results/api_experiment_run_15/model/model_weights tokenizer Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Max length of feature 'None': 143 (without start and stop symbols) Max sequence length is 143 for feature 'None' Loaded HuggingFace implementation of /home/ubuntu/results/api_experiment_run_15/model/model_weights tokenizer Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Max length of feature 'Response': 144 (without start and stop symbols) Max sequence length is 144 for feature 'Response' Loaded HuggingFace implementation of /home/ubuntu/results/api_experiment_run_15/model/model_weights tokenizer Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Loaded HuggingFace implementation of /home/ubuntu/results/api_experiment_run_15/model/model_weights tokenizer Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Building dataset: DONE Writing preprocessed training set cache to /home/ubuntu/eeff4f02cfeb11ee808f12abaaebd043.training.hdf5 Writing preprocessed validation set cache to /home/ubuntu/eeff4f02cfeb11ee808f12abaaebd043.validation.hdf5 Writing preprocessed test set cache to /home/ubuntu/eeff4f02cfeb11ee808f12abaaebd043.test.hdf5 Writing train set metadata to /home/ubuntu/eeff4f02cfeb11ee808f12abaaebd043.meta.json Dataset Statistics ╒════════════╤═══════════════╤════════════════════╕ │ Dataset │ Size (Rows) │ Size (In Memory) │ ╞════════════╪═══════════════╪════════════════════╡ │ Training │ 31 │ 7.39 Kb │ ├────────────┼───────────────┼────────────────────┤ │ Validation │ 4 │ 1.06 Kb │ ├────────────┼───────────────┼────────────────────┤ │ Test │ 9 │ 2.23 Kb │ ╘════════════╧═══════════════╧════════════════════╛ ╒═══════╕ │ MODEL │ ╘═══════╛ Warnings and other logs: Loading large language model... We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [01:15<01:15, 75.45s/it] Loading checkpoint shards: 100%|██████████| 2/2 [01:42<00:00, 46.69s/it] Loading checkpoint shards: 100%|██████████| 2/2 [01:42<00:00, 51.01s/it] Done. Loaded HuggingFace implementation of /home/ubuntu/results/api_experiment_run_15/model/model_weights tokenizer Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing! ================================================== Trainable Parameter Summary For Fine-Tuning Fine-tuning with adapter: lora trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199 ================================================== Gradient checkpointing enabled for training. ╒══════════╕ │ TRAINING │ ╘══════════╛ Creating fresh model training run. Training for 93 step(s), approximately 3 epoch(s). Early stopping policy: 5 round(s) of evaluation, or 155 step(s), approximately 5 epoch(s). Starting with step 0, epoch: 0 Training: 0%| | 0/93 [00:00<?, ?it/s]/opt/conda/envs/ludwig_train_env/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") Unable to complete the finetuning due to error Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) Training: 0%| | 0/93 [00:00<?, ?it/s]
Can you help us in solving this TIA
Hi @all, I am also facing the same issue. Did anyone resolve it?
Thanks
Hi, We are trying to perform incremental training...but we are facing the following error
Complete log file->
Can you help us in solving this TIA