richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
325 stars 42 forks source link

Pyarrow dataloading issue #32

Closed skaulintel closed 2 years ago

skaulintel commented 2 years ago

Hi Richard,

I get the following pyarrow issue when trying to load the openwebtext corpus dataset>

Traceback (most recent call last):
  File "pretrain.py", line 150, in <module>
    e_owt = ELECTRAProcessor(owt, apply_cleaning=False).map(cache_file_name=f"electra_owt_{c.max_length}.arrow", num_proc=1)
  File "/root/_utils/utils.py", line 120, in map
    return self.hf_dset.my_map(
  File "/usr/local/lib/python3.8/dist-packages/hugdatafast/transform.py", line 23, in my_map
    return self.map(*args, cache_file_name=cache_file_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2102, in map
    return self._map_single(
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 485, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/fingerprint.py", line 413, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2498, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 499, in write_batch
    self.write_table(pa_table, writer_batch_size)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 516, in write_table
    self.pa_writer.write_batch(batch)
  File "pyarrow/ipc.pxi", line 384, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Tried to write record batch with different schema

Any ideas?

Best, Shiv

richarddwang commented 2 years ago

Hi, I didn't encounter error like this. May be you can try :

  1. Downgrade huggingface/datasets to 1.1.0 as shown in requirements.txt
  2. Put an pdb at here and see what is happening. https://github.com/richarddwang/electra_pytorch/blob/ab29d03e69c6fb37df238e653c8d1a81240e3dd6/_utils/utils.py#L149-L150
richarddwang commented 2 years ago

Hi @skaulintel Recently I updated hf/datasets for another project and found the same problem in a similar setting. After some debugging, I found a potential bug in hf/datasets (https://github.com/huggingface/datasets/pull/3782). And I have modified data processor here by turn disable_nullable to False here. https://github.com/richarddwang/electra_pytorch/blob/ba35cf6b85ba1c3264c44c2f67e18d46d5e84f52/_utils/utils.py#L124 It should work now. If there is any else I can help, please tag me to reopen the issue.