google-research / pegasus

Apache License 2.0
1.59k stars 315 forks source link

mask token id == 3 during pretraining? #231

Open whaleloops opened 1 year ago

whaleloops commented 1 year ago

I noticed that kMaskWordTokenId (mask2 as defined in the paper) is 3 as defined below. https://github.com/google-research/pegasus/blob/main/pegasus/ops/pretrain_parsing_ops.cc#L69

However, the id of token 'a' is also 3 in sentencepiece vocab from "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model"

@EKebriaei