megagonlabs / ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Apache License 2.0
256 stars 88 forks source link

Adding custom tokens #29

Open ajaybabu20 opened 1 year ago

ajaybabu20 commented 1 year ago

Hey guys ! I had fun reading the paper and thanks for open-sourcing the model.

In the paper, you guys mentioned where [COL] and [VAL] are special tokens for indicating the start of attribute names and values respectively. Meaning that [COL] and [VAL] are special tokens that are to be added to the tokenizer. In the repo https://github.com/megagonlabs/ditto/blob/master/ditto_light/dataset.py#L12, you guys are not adding this as special tokens to the vocabulary of the pre-trained tokenizer.

Any reason why?