Hi! I find that Deberta-v3 uses relative-position embedding so that it can takes in larger context compared to traditional BERT. Have you tried to pretrain deberta-v3 by 1024 or larger?
If I need to pretrain deberta-v3 from the scratch using a larger context length (e.g., 1024), are there any modification I should make besides the training script?
Hi! I find that Deberta-v3 uses relative-position embedding so that it can takes in larger context compared to traditional BERT. Have you tried to pretrain deberta-v3 by 1024 or larger?
If I need to pretrain deberta-v3 from the scratch using a larger context length (e.g., 1024), are there any modification I should make besides the training script?
Thanks for any kind help!