goodbai-nlp / AMRBART

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022
MIT License
92 stars 28 forks source link

'PENMANBartTokenizer' object has no attribute 'amr_bos_token_id' #9

Closed PhMeier closed 1 year ago

PhMeier commented 1 year ago

Hello, when using the script inference_amr.sh I receive the following error:

Please answer yes or no.
Global seed set to 42
Tokenizer: 53587 PreTrainedTokenizer(name_or_path='facebook/bart-large', vocab_size=53587, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': 'Ġ<s>', 'eos_token': 'Ġ</s>', 'unk_token': 'Ġ<unk>', 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': 'Ġ<pad>', 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})
Traceback (most recent call last):
  File "/home/students/meier/MA/AMRBART/fine-tune/inference_amr.py", line 105, in <module>
    main(args)
  File "/home/students/meier/MA/AMRBART/fine-tune/inference_amr.py", line 65, in main
    data_module = AMRParsingDataModule(amr_tokenizer, **vars(args))
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 228, in __init__
    decoder_start_token_id=self.tokenizer.amr_bos_token_id,
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token_id'

The facebook/bart-large tokenizer is used. This error is new, since I used the scripts 8 to 6 weeks ago and everything worked fine.

A similar error can be seen when using inferece_text.sh:

Please answer yes or no.
Global seed set to 42
Tokenizer: 53587 PreTrainedTokenizer(name_or_path='facebook/bart-large', vocab_size=53587, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': 'Ġ<s>', 'eos_token': 'Ġ</s>', 'unk_token': 'Ġ<unk>', 'sep_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': 'Ġ<pad>', 'cls_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True)})
Dataset cache dir: /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/
Using custom data configuration default-288dad464b8291c3
Downloading and preparing dataset amr_data/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/amr_data/default-288dad464b8291c3/1.0.0/f0dfbe4d826478b18bc1ef4db7270a419c69c4ea4c94fbf73515b13180f43059...
^M0 examples [00:00, ? examples/s]^M                                ^M^M0 examples [00:00, ? examples/s]^M                                ^M^M0 examples [00:00, ? examples/s]^M                                ^MDataset amr_data downloaded and prepared to /home/students/meier/MA/AMRBART/fine-tune/../examples/.cache/amr_data/default-288dad464b8291c3/1.0.0/f0dfbe4d826478b18bc1ef4db7270a419c69c4ea4c94fbf73515b13180f43059. Subsequent calls will reuse this data.
datasets: DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 10
    })
})
colums: ['src', 'tgt']
Setting TOKENIZERS_PARALLELISM=false for forked processes.
Parameter 'function'=<function AMR2TextDataModule.setup.<locals>.tokenize_function at 0x154ba6915280> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
^M #0:   0%|          | 0/1 [00:00<?, ?ba/s]^M #0:   0%|          | 0/1 [00:00<?, ?ba/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/fingerprint.py", line 397, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2016, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1906, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 72, in tokenize_function
    amr_tokens = [
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 74, in <listcomp>
    + [self.tokenizer.amr_bos_token]
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/students/meier/MA/AMRBART/fine-tune/run_amr2text.py", line 154, in <module>
    main(args)
  File "/home/students/meier/MA/AMRBART/fine-tune/run_amr2text.py", line 91, in main
    data_module.setup()
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
    fn(*args, **kwargs)
  File "/home/students/meier/MA/AMRBART/fine-tune/data_interface/dataset_pl.py", line 117, in setup
    self.train_dataset = datasets["train"].map(
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1744, in map
    transformed_shards = [r.get() for r in results]
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1744, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/students/meier/amrbart_venv_new/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: 'PENMANBartTokenizer' object has no attribute 'amr_bos_token'
Manni-Arora commented 1 year ago

Getting the same error.

goodbai-nlp commented 1 year ago

Hi @PhMeier @Manni-Arora, please try to reinstall the spring package by

cd spring && pip install -e .

If you still get any errors, please post them here.

PhMeier commented 1 year ago

Thank you @muyeby it works now on my side!