Open lsy641 opened 1 year ago
I read the paper, is there any code available which showcases the algorithm?
I read the paper, is there any code available which showcases the algorithm?
@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account
I read the paper, is there any code available which showcases the algorithm?
@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account
Yes that would be great (this is my github account: RubyBit)
@lsy641 I am so sorry, can you resend the invite? I didn't check my mail in time.
I saw this previous dicussion is really interesting at Multi-word segmentation #220 and knew you project members have experimented the segmentation beyond word-level on MT datasets and didn't see significant improvement.
I think it is because the segmentation of sub-word vocabulary was already trained from MT data, and there is little improvement room in effectiveness by changing granularity, although increaing granularity can bring efficiency boost. But in the era of pretraining models, I rethink to change the granularity and compositionality of generation in downstream domain.
Recently, our work((https://arxiv.org/abs/2310.05317)) provides a solution to make pretraining model be able to adopt a task-adaptive tokenizer, which supports variable segmentation optimized by the downstream data. Then it allows multi bigger granular segamentations (still retaining sub-word level) to be sampled. It does bring significant improvement in both generation effectiveness and efficiency for the tasks where task-specific terminologies often show up (e.g., medical, mental healh) The improvement is from two sources. 1. The gap between the pretraining vocabulary (for exampl, Bert vocabulary is optimized by GNMT benchmark that may be suitable for MT, but not for other tasks) and the downstream language style. 2.The second is the potential of varabile segamentation on efficiency.
To build a task-adaptive tokenizer, currently I manually sew the pretraining vocabulary and the downstream vocabulary by using the ProtoBuf apis provided by sentencepiece_model_bp2.py and sentencepiece_bp2.py and build a new tokenizer compatible with HuggingFace. I saw wondering if your project is interested to provide a funtion for researchers to easily build a task-adatpive tokenizer.