DatasetGenerationError starcoder

Katehuuh commented 11 months ago

Reminder

[x] I have read the README and searched the existing issues.

Reproduction

After download provided Pre-training datasets: StarCoder (en) (783GB) --dataset_dir data --dataset starcoder:

C:\Users\Kate\.cache\huggingface\datasets\downloads\3680f5de2959a097abc9b9867e4fc2ea930bf9fa2cc78b6b6826cb2cef29e265' with error <class 'ValueError'>: Couldn't cast
id: string
content: string
-- schema metadata --
huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 60
to
{'max_stars_repo_path': Value(dtype='string', id=None), 'max_stars_repo_name': Value(dtype='string', id=None), 'max_stars_count': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None)}
because column names don't match
...
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Full error output

```cmd _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=saves\LLaMA2-13B-Chat\lora\QLoRA-Llama13B\runs\Dec09_01-35-37_user, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=saves\LLaMA2-13B-Chat\lora\QLoRA-Llama13B, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=saves\LLaMA2-13B-Chat\lora\QLoRA-Llama13B\checkpoint-200000, run_name=saves\LLaMA2-13B-Chat\lora\QLoRA-Llama13B, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 12/08/2023 20:35:37 - INFO - llmtuner.data.loader - Loading dataset bigcode/starcoderdata... Resolving data files: 100%|████████████████████████████████████████████████████████| 863/863 [00:00<00:00, 1652.60it/s] Using custom data configuration default-5b378f25f9ab6f16 Loading Dataset Infos from C:\LLaMA-Factory\venv\lib\site-packages\datasets\packaged_modules\parquet Generating dataset starcoderdata (C:/Users/Kate/.cache/huggingface/datasets/bigcode___starcoderdata/default-5b378f25f9ab6f16/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02) Downloading and preparing dataset starcoderdata/default to C:/Users/Kate/.cache/huggingface/datasets/bigcode___starcoderdata/default-5b378f25f9ab6f16/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02... Dataset not on Hf google storage. Downloading and preparing it from source Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.29it/s] Downloading took 0.0 min Checksum Computation took 0.0 min Extracting data files: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.52it/s] Generating train split Generating train split: 31955357 examples [05:12, 48734.96 examples/s]Failed to read file 'C:\Users\Kate\.cache\huggingface\datasets\downloads\3680f5de2959a097abc9b9867e4fc2ea930bf9fa2cc78b6b6826cb2cef29e265' with error : Couldn't cast id: string content: string -- schema metadata -- huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 60 to {'max_stars_repo_path': Value(dtype='string', id=None), 'max_stars_repo_name': Value(dtype='string', id=None), 'max_stars_count': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None)} because column names don't match Generating train split: 31955357 examples [05:12, 102113.35 examples/s] Traceback (most recent call last): File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 1925, in _prepare_split_single for _, table in generator: File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\packaged_modules\parquet\parquet.py", line 86, in _generate_tables yield f"{file_idx}_{batch_idx}", self._cast_table(pa_table) File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\packaged_modules\parquet\parquet.py", line 66, in _cast_table pa_table = table_cast(pa_table, self.info.features.arrow_schema) File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\table.py", line 2328, in table_cast return cast_table_to_schema(table, schema) File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\table.py", line 2286, in cast_table_to_schema raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast id: string content: string -- schema metadata -- huggingface: '{"info": {"features": {"id": {"dtype": "string", "_type": "' + 60 to {'max_stars_repo_path': Value(dtype='string', id=None), 'max_stars_repo_name': Value(dtype='string', id=None), 'max_stars_count': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None)} because column names don't match The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\LLaMA-Factory\src\train_bash.py", line 14, in main() File "C:\LLaMA-Factory\src\train_bash.py", line 5, in main run_exp() File "C:\LLaMA-Factory\src\llmtuner\train\tuner.py", line 24, in run_exp run_pt(model_args, data_args, training_args, finetuning_args, callbacks) File "C:\LLaMA-Factory\src\llmtuner\train\pt\workflow.py", line 24, in run_pt dataset = get_dataset(model_args, data_args) File "C:\LLaMA-Factory\src\llmtuner\data\loader.py", line 56, in get_dataset dataset = load_dataset( File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\load.py", line 2153, in load_dataset builder_instance.download_and_prepare( File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 954, in download_and_prepare self._download_and_prepare( File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 1049, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 1813, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\builder.py", line 1958, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset ```

Expected behavior

No response

System Info

env on commit d42c0b1

- `transformers` version: 4.34.0 - Platform: Windows-10-10.0.22621-SP0 - Python version: 3.10.8 - Huggingface_hub version: 0.17.3 - Safetensors version: 0.4.0 - Accelerate version: 0.23.0 - Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: NO - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 1 - machine_rank: 0 - num_machines: 1 - gpu_ids: all - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] - PyTorch version (GPU?): 2.2.0.dev20231013+cu121 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes

Others

No response

hiyouga commented 11 months ago

Add "subset": "python" to this object in the dataset_info.json: https://github.com/hiyouga/LLaMA-Factory/blob/d42c0b1d3482af5912ebe578b3e6b4d08cd7ee99/data/dataset_info.json#L277-L282

Katehuuh commented 11 months ago

How could i include all or list than "subset": [ "ada", "agda", ... ]?

hiyouga commented 11 months ago

We cannot provide a list to the subset field, therefore, it can only be used in the following way, as well as by using --dataset starcoder_py,starcoder_cpp

"starcoder_py": { 
  "hf_hub_url": "bigcode/starcoderdata",
  "columns": {
    "prompt": "content"
  },
  "subset": "python"
},
"starcoder_cpp": {
  "hf_hub_url": "bigcode/starcoderdata",
  "columns": {
    "prompt": "content"
  },
  "subset": "cpp"
}

Katehuuh commented 11 months ago

This does give me the same error.

...
      "prompt": "content"
    }
  },
  "starcoder": {
    "hf_hub_url": "bigcode/starcoderdata",
    "columns": {
      "prompt": "content"
    },
    "subset": "python"
  }
}

Katehuuh commented 11 months ago

Same for "subset": "cpp". (see above "Full error output") Could be issue related to specific"subset": jupyter-scripts-dedup-filtered, jupyter-structured-clean-dedup, github-issues-filtered-structured, git-commits-cleaned.

hiyouga commented 11 months ago

fixed, use the folder field to specify the category of programming language

https://github.com/hiyouga/LLaMA-Factory/blob/28d5de7e785f31b223a4646c9c1c770f43e187ec/data/dataset_info.json#L277-L283

Katehuuh commented 11 months ago

I have it working "starcoder_python" and starcoder_ada,starcoder_agda but trying to load all (92 category folder of programming language) than give me following error:

Loading Dataset info from G:/Dataset/cache/datasets/bigcode___starcoderdata/default-864eb43342f90586/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
Traceback (most recent call last):
  File "C:\LLaMA-Factory\src\train_bash.py", line 14, in <module>
    main()
  File "C:\LLaMA-Factory\src\train_bash.py", line 5, in main
    run_exp()
  File "C:\LLaMA-Factory\src\llmtuner\train\tuner.py", line 24, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "C:\LLaMA-Factory\src\llmtuner\train\pt\workflow.py", line 24, in run_pt
    dataset = get_dataset(model_args, data_args)
  File "C:\LLaMA-Factory\src\llmtuner\data\loader.py", line 138, in get_dataset
    return concatenate_datasets(all_datasets)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\combine.py", line 213, in concatenate_datasets
    return _concatenate_map_style_datasets(dsets, info=info, split=split, axis=axis)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 6002, in _concatenate_map_style_datasets
    _check_if_features_can_be_aligned([dset.features for dset in dsets])
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\features\features.py", line 2122, in _check_if_features_can_be_aligned
    raise ValueError(
ValueError: The features can't be aligned because the key max_stars_count of features {'max_stars_repo_path': Value(dtype='string', id=None), 'max_stars_repo_name': Value(dtype='string', id=None), 'max_stars_count': Value(dtype='float64', id=None), 'id': Value(dtype='string', id=None), 'prompt': Value(dtype='string', id=None)} has unexpected type - Value(dtype='float64', id=None) (expected either Value(dtype='int64', id=None) or Value("null").

I have tried remove dataset with no success:

jupyter-scripts-dedup-filtered, jupyter-structured-clean-dedup, github-issues-filtered-structured, git-commits-cleaned.

~~Can it print or skip the issue dataset?~~

Katehuuh commented 11 months ago

@hiyouga given only --dataset starcoder_bluespec,starcoder_c-sharp it have the same error.

but work fine using only one of the two, e.g. for starcoder_c-sharp

)
12/11/2023 06:30:24 - INFO - llmtuner.data.loader - Loading dataset bigcode/starcoderdata...
Resolving data files: 100%|████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 67.30it/s]
Using custom data configuration default-d7cce75433355113
Loading Dataset Infos from C:\LLaMA-Factory\venv\lib\site-packages\datasets\packaged_modules\parquet
Overwrite dataset info from restored data version if exists.
Loading Dataset info from G:\Dataset\cache\datasets\bigcode___starcoderdata\default-d7cce75433355113\0.0.0\0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
Found cached dataset starcoderdata (G:/Dataset/cache/datasets/bigcode___starcoderdata/default-d7cce75433355113/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02)
Loading Dataset info from G:/Dataset/cache/datasets/bigcode___starcoderdata/default-d7cce75433355113/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,595 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2041] 2023-12-11 06:35:22,596 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:713] 2023-12-11 06:35:22,757 >> loading configuration file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\config.json
[INFO|configuration_utils.py:775] 2023-12-11 06:35:22,758 >> Model config LlamaConfig {
  "_name_or_path": "C:\\LLaMA-Factory\\checkpoints\\Llama-2-13b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13824,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 40,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.34.0",
  "use_cache": true,
  "vocab_size": 32000
}

12/11/2023 06:35:22 - WARNING - llmtuner.model.loader - Input length is smaller than max length. Consider increase input length.
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Using linear scaling strategy and setting scaling factor to 1.0
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Using FlashAttention-2 for faster training and inference.
12/11/2023 06:35:22 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2990] 2023-12-11 06:35:23,770 >> loading weights file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\model.safetensors.index.json
[INFO|modeling_utils.py:1220] 2023-12-11 06:35:23,800 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:770] 2023-12-11 06:35:23,801 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

[INFO|modeling_utils.py:3103] 2023-12-11 06:35:24,469 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards:   0%|
...
KeyboardInterrupt

and it is not the case for all, the first 10 work fine: --dataset starcoder_ada,starcoder_agda,starcoder_alloy,starcoder_antlr,starcoder_applescript,starcoder_assembly,starcoder_augeas,starcoder_awk,starcoder_batchfile,starcoder_bluespec

hiyouga / LLaMA-Factory