universal-ie / UIE

Unified Structure Generation for Universal Information Extraction
877 stars 99 forks source link

预训练数据的加载规模问题 #59

Open williamSYSU opened 1 year ago

williamSYSU commented 1 year ago

陆博您好,很感谢您公开UIE模型的代码!

在程序加载构造的预训练数据时,报了以下错误:


Traceback (most recent call last):                                                                                                                                                       
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1874, in _prepare_split_single                                                            
    writer.write_table(table)                                                                                                                                                            
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table                                                                  
    pa_table = pa_table.combine_chunks()                                                                                                                                                 
  File "pyarrow/table.pxi", line 3315, in pyarrow.lib.Table.combine_chunks                                                                                                               
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status                                                                                                       
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status                                                                                                                        
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays                                                                                                                     

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_uie_pretrain.py", line 509, in <module>
    main()
  File "run_uie_pretrain.py", line 148, in main
    datasets = load_dataset(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

当数据集规模为500w时,会报以上的错误,而当数据集规模减少至100w时,程序可以正常运行,因此从报错原因来看是因为数据集太大从而导致加载出错,而且此时内存未满。

因此有几个问题想请教您:

  1. 程序加载数据集时是完全加载到内存里吗?因为看论文中数据集的规模是65M * 3 = 195M,请问这个是怎么实现这么大规模数据的预训练呢?
  2. 是否存在对数据进行流式处理的训练方式?