sambanova / generative_data_prep

Apache License 2.0
58 stars 7 forks source link

Apply Chat Template Error #102

Closed snova-zoltanc closed 3 months ago

snova-zoltanc commented 3 months ago

Command: python -m generative_data_prep pipeline --input_file_path=/Users/karent/Documents/Code/data/superglue_boolq_trainchat_template.jsonl --output_path=Users/karent/Documents/Code/data/data_prep_output_trainchat_template_apply --shuffle on_RAM --pretrained_tokenizer=meta-llama/Llama-2-7b-chat-hf --max_seq_length=4096 --input_packing_config greedy::drop --num_training_splits=8 --num_dev_splits=0 --num_test_splits=0 --keep_split_jsonls --apply_chat_template Error message:

Tokenization is complete, the output dataset is located at: Users/karent/Documents/Code/data/data_prep_output_trainchat_template_apply

Balancing hdf5 files to ensure they have the same number of sequences. Traceback (most recent call last): File "/opt/homebrew/Cellar/python@3.9/3.9.19/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/homebrew/Cellar/python@3.9/3.9.19/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/karent/Documents/Code/generative_data_prep/generative_data_prep/main.py", line 384, in main(args) File "/Users/karent/Documents/Code/generative_data_prep/generative_data_prep/main.py", line 326, in main metrics, dataset_metadata = pipeline_main( File "/Users/karent/Documents/Code/generative_data_prep/generative_data_prep/data_prep/pipeline.py", line 651, in pipeline_main balance_hdf5_files(train_hdf5_files, dataset_metadata_json, "train") File "/Users/karent/Documents/Code/generative_data_prep/generative_data_prep/utils/balance_hdf5_files.py", line 59, in balance_hdf5_files tot_seqs += curr_hdf5_file["input_ids"].shape[0] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/Users/karent/Environments/gen_data_prep_3_9/lib/python3.9/site-packages/h5py/_hl/group.py", line 357, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 189, in h5py.h5o.open KeyError: "Unable to open object (object 'input_ids' doesn't exist)" Am able to get it to work without the apply_chat_template flag: python -m generative_data_prep pipeline --input_file_path=/Users/karent/Documents/Code/data/superglue_boolq_trainchat_template.jsonl --output_path=/Users/karent/Documents/Code/data/data_prep_output_trainchat_template_new --shuffle on_RAM --pretrained_tokenizer=meta-llama/Llama-2-7b-chat-hf --max_seq_length=4096 --input_packing_config greedy::drop --num_training_splits=8 --num_dev_splits=0 --num_test_splits=0 --keep_split_jsonls --prompt_prefix '[INST] ' --prompt_postfix '[/INST]'

snova-zoltanc commented 3 months ago

Fixed by PR https://github.com/sambanova/generative_data_prep/pull/104