Closed Katehuuh closed 11 months ago
Add "subset": "python"
to this object in the dataset_info.json:
https://github.com/hiyouga/LLaMA-Factory/blob/d42c0b1d3482af5912ebe578b3e6b4d08cd7ee99/data/dataset_info.json#L277-L282
How could i include all
or list than "subset": [ "ada", "agda", ... ]
?
We cannot provide a list to the subset
field, therefore, it can only be used in the following way, as well as by using --dataset starcoder_py,starcoder_cpp
"starcoder_py": {
"hf_hub_url": "bigcode/starcoderdata",
"columns": {
"prompt": "content"
},
"subset": "python"
},
"starcoder_cpp": {
"hf_hub_url": "bigcode/starcoderdata",
"columns": {
"prompt": "content"
},
"subset": "cpp"
}
This does give me the same error.
...
"prompt": "content"
}
},
"starcoder": {
"hf_hub_url": "bigcode/starcoderdata",
"columns": {
"prompt": "content"
},
"subset": "python"
}
}
Same for "subset": "cpp"
. (see above "Full error output")
Could be issue related to specific"subset":
jupyter-scripts-dedup-filtered, jupyter-structured-clean-dedup, github-issues-filtered-structured, git-commits-cleaned.
fixed, use the folder
field to specify the category of programming language
I have it working "starcoder_python"
and starcoder_ada,starcoder_agda
but trying to load all (92 category folder of programming language) than give me following error:
Loading Dataset info from G:/Dataset/cache/datasets/bigcode___starcoderdata/default-864eb43342f90586/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
Traceback (most recent call last):
File "C:\LLaMA-Factory\src\train_bash.py", line 14, in <module>
main()
File "C:\LLaMA-Factory\src\train_bash.py", line 5, in main
run_exp()
File "C:\LLaMA-Factory\src\llmtuner\train\tuner.py", line 24, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "C:\LLaMA-Factory\src\llmtuner\train\pt\workflow.py", line 24, in run_pt
dataset = get_dataset(model_args, data_args)
File "C:\LLaMA-Factory\src\llmtuner\data\loader.py", line 138, in get_dataset
return concatenate_datasets(all_datasets)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\combine.py", line 213, in concatenate_datasets
return _concatenate_map_style_datasets(dsets, info=info, split=split, axis=axis)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 6002, in _concatenate_map_style_datasets
_check_if_features_can_be_aligned([dset.features for dset in dsets])
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\features\features.py", line 2122, in _check_if_features_can_be_aligned
raise ValueError(
ValueError: The features can't be aligned because the key max_stars_count of features {'max_stars_repo_path': Value(dtype='string', id=None), 'max_stars_repo_name': Value(dtype='string', id=None), 'max_stars_count': Value(dtype='float64', id=None), 'id': Value(dtype='string', id=None), 'prompt': Value(dtype='string', id=None)} has unexpected type - Value(dtype='float64', id=None) (expected either Value(dtype='int64', id=None) or Value("null").
I have tried remove dataset with no success:
Can it print or skip the issue dataset?
@hiyouga given only --dataset starcoder_bluespec,starcoder_c-sharp
it have the same error. but work fine using only one of the two, e.g. for starcoder_c-sharp
)
12/11/2023 06:30:24 - INFO - llmtuner.data.loader - Loading dataset bigcode/starcoderdata...
Resolving data files: 100%|████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 67.30it/s]
Using custom data configuration default-d7cce75433355113
Loading Dataset Infos from C:\LLaMA-Factory\venv\lib\site-packages\datasets\packaged_modules\parquet
Overwrite dataset info from restored data version if exists.
Loading Dataset info from G:\Dataset\cache\datasets\bigcode___starcoderdata\default-d7cce75433355113\0.0.0\0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
Found cached dataset starcoderdata (G:/Dataset/cache/datasets/bigcode___starcoderdata/default-d7cce75433355113/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02)
Loading Dataset info from G:/Dataset/cache/datasets/bigcode___starcoderdata/default-d7cce75433355113/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,595 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:713] 2023-12-11 06:35:22,757 >> loading configuration file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\config.json
[INFO|configuration_utils.py:775] 2023-12-11 06:35:22,758 >> Model config LlamaConfig {
"_name_or_path": "C:\\LLaMA-Factory\\checkpoints\\Llama-2-13b-chat-hf",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 40,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.34.0",
"use_cache": true,
"vocab_size": 32000
}
12/11/2023 06:35:22 - WARNING - llmtuner.model.loader - Input length is smaller than max length. Consider increase input length.
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Using linear scaling strategy and setting scaling factor to 1.0
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Using FlashAttention-2 for faster training and inference.
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2990] 2023-12-11 06:35:23,770 >> loading weights file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\model.safetensors.index.json
[INFO|modeling_utils.py:1220] 2023-12-11 06:35:23,800 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:770] 2023-12-11 06:35:23,801 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2
}
[INFO|modeling_utils.py:3103] 2023-12-11 06:35:24,469 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 0%|
...
KeyboardInterrupt
and it is not the case for all, the first 10 work fine: --dataset starcoder_ada,starcoder_agda,starcoder_alloy,starcoder_antlr,starcoder_applescript,starcoder_assembly,starcoder_augeas,starcoder_awk,starcoder_batchfile,starcoder_bluespec
Reminder
Reproduction
After download provided Pre-training datasets: StarCoder (en) (783GB)
--dataset_dir data --dataset starcoder
:Full error output
```cmd _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=
Expected behavior
No response
System Info
env on commit d42c0b1
- `transformers` version: 4.34.0 - Platform: Windows-10-10.0.22621-SP0 - Python version: 3.10.8 - Huggingface_hub version: 0.17.3 - Safetensors version: 0.4.0 - Accelerate version: 0.23.0 - Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: NO - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 1 - machine_rank: 0 - num_machines: 1 - gpu_ids: all - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] - PyTorch version (GPU?): 2.2.0.dev20231013+cu121 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes
Others
No response