project-baize / baize-chatbot

Let ChatGPT teach your own chatbot in hours with a single GPU!
https://arxiv.org/abs/2304.01196
GNU General Public License v3.0
3.15k stars 275 forks source link

try train 25G data/quora_chat_data failed #41

Open yfq512 opened 1 year ago

yfq512 commented 1 year ago

CUDA SETUP: Detected CUDA version 113 CUDA SETUP: Loading binary /opt/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-e59c3670f1657ac9/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e... Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 2349.75it/s] Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 483.88it/s] Traceback (most recent call last):
File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1860, in _prepare_splitsingle for , table in generator: File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 113, in _generate_tables io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) File "pyarrow/_json.pyx", line 55, in pyarrow._json.ReadOptions.init File "pyarrow/_json.pyx", line 80, in pyarrow._json.ReadOptions.block_size.set OverflowError: value too large to convert to int32_t

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "finetune.py", line 51, in data = load_dataset("json", data_files=DATA_PATH) File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1791, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 891, in download_and_prepare self._download_and_prepare( File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 986, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1748, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/opt/conda/envs/py38/lib/python3.8/site-packages/datasets/builder.py", line 1893, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

这种由于训练数据太大而出现的问题,要怎么解决呢?

JetRunner commented 1 year ago

Not completely sure but this may be helpful: https://stackoverflow.com/questions/68652157/how-do-i-debug-overflowerror-value-too-large-to-convert-to-int32-t