Tokenizer for AMRBART-large-finetuned-AMR3.0-AMRParsing

goodbai-nlp / AMRBART

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

MIT License

92 stars 28 forks source link

Tokenizer for AMRBART-large-finetuned-AMR3.0-AMRParsing #13

Closed HenryCai11 closed 1 year ago

HenryCai11 commented 1 year ago

I noticed that for the finetuned AMRBarts, there are no tokenizers offered in the huggingface hub, whereas the v2 models have tokenizers with a different vocab size (v1 53844 vs. v2 53228). My questions are:

Where can I get the tokenizers for those finetuned models?
Is there a demonstration for tokens used in v2 models (because I found that newly-added tokens used in v2 models are different from tokens illustrated in the paper)?
Is it OK for me to use BartTokenizer to load the pretrained AMR tokenizers?

Thank you!

goodbai-nlp commented 1 year ago

Hi,

Thanks for your interest!

For the v1 model, we initialize the tokenizer from the vanilla BART tokenizer following SPRING where the additional tokens are added here. So there is no need to upload the tokenizer.
For the v2 model. we re-implement the tokenizer here. You may initialize the tokenizer from the one we uploaded following here.
The reason why v2 models have a small vocabulary size than v1 is that we use fewer special \<pointer:xx> tokens and remove \<mask:xx> tokens. You may compare the code here and here.

Hope these comments can help you.

HenryCai11 commented 1 year ago

Thank you so much!

HenryCai11 commented 1 year ago

@goodbai-nlp Hi, sorry to bother again. I still wonder how should I initialize the toeknizer with AMRBartTokenizer.

from transformers import BartForConditionalGeneration, BartTokenizer, AutoConfig
from spring_amr.tokenization_bart import AMRBartTokenizer

config = AutoConfig.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing")
model = BartForConditionalGeneration.from_pretrained("xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing")
tokenizer = AMRBartTokenizer.from_pretrained("facebook/bart-large", config=config)

I tried initializing this way. However, length of the tokenizer did not match the vocab size in config. Did I miss the point for the initialization? Looking forward to your reply. Thank you!

goodbai-nlp commented 1 year ago

Hi,

I assume you are trying to initialize the tokenizer in the v1 version. You may follow the code here. Additionally, there is no need to pass the config parameter when initializing our tokenizer.

HenryCai11 commented 1 year ago

Thank you!