AttributeError: 'InMemoryTable' object has no attribute '_batches'

matsuobasho commented 9 months ago

Describe the bug

Traceback (most recent call last):
  File "finetune.py", line 103, in <module>
    main(args)
  File "finetune.py", line 45, in main
    data_tokenized = data.map(partial(funcs.tokenize_function, tokenizer,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/dataset_dict.py", line 868, in map
    {
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/dataset_dict.py", line 869, in <dictcomp>
    k: dataset.map(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3093, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3432, in _map_single
    arrow_formatted_shard = shard.with_format("arrow")
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2667, in with_format
    dataset = copy.deepcopy(self)
  File "/opt/conda/envs/ptca/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/opt/conda/envs/ptca/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/envs/ptca/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/envs/ptca/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/envs/ptca/lib/python3.8/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/datasets/table.py", line 176, in __deepcopy__
    memo[id(self._batches)] = list(self._batches)
AttributeError: 'InMemoryTable' object has no attribute '_batches'

Steps to reproduce the bug

I'm running an MLOps flow using AzureML.
The error appears when I run the following function in my training script:

data_tokenized = data.map(partial(funcs.tokenize_function, tokenizer,
                                      seq_length),
                              batched=True,
                              batch_size=batch_size,
                              remove_columns=['col1', 'col2'])

def tokenize_function(tok, seq_length, example)
    # Pad so that each batch has the same sequence length
    inp = tok(example['col1'], padding=True, truncation=True)
    outp = tok(example['col2'], padding="max_length", max_length=seq_length)

    res = {
        'input_ids': inp['input_ids'],
        'attention_mask': inp['attention_mask'],
        'decoder_input_ids': outp['input_ids'],
        'labels': outp['input_ids'],
        'decoder_attention_mask': outp['attention_mask']
    }
    return res

Expected behavior

Processing proceeds without errors. I ran this same workflow 2 weeks ago without a problem. I recreated the environment since then but it doesn't appear that datasets versions have changed since Dec. '23.

Environment info

datasets 2.16.1 transformers 4.35.2 pyarrow 15.0.0 pyarrow-hotfix 0.6 torch 2.0.1

I'm not using the latest transformers version because there was an error due to a conflict with Azure mlflow when I tried the last time.

mariosasko commented 9 months ago

Hi! Does running the following code also return the same error on your machine?

import copy
import pyarrow as pa
from datasets.table import InMemoryTable

copy.deepcopy(InMemoryTable(pa.table({"a": [1, 2, 3], "b": ["foo", "bar", "foobar"]})))

matsuobasho commented 9 months ago

No, it doesn't, it runs fine. But what's really strange is that the error just went away after I reran the data prep script for conversion from csv to a datasets object. I realize that's not very helpful since the problem isn't reproducible.

mariosasko commented 9 months ago

Feel free to close the issue then :).

huggingface / datasets