Open rajicon opened 1 year ago
word_process, which was supposed to be there was basically executed if a spelling of a word was not completed (you can see the check for that in line 86). Similar to Bert (in the line above it), the idea is to remove the space token (which is not the one indicative of word completion, rather is part of the subword system spelling), in order to concatenate to a following subword.
Ok, that makes sense. Is it possible to get the file still?
Glad it made sense! Unfortunately I have no longer access to the original server I developed this code on. The best advice I can give you is that you can run it first on BERT, and see how the regular expression is applied to the BERT tokenization, and then write a similar function based on the specific RoBerta tokenization. It should not be a complicated function. You just need to make sure that the RoBerta special`space' token is removed, and return the concatenated sequence (w/o spaces)
Is word_process supposed to be a file in this directory, or an external library? I can't figure out how to get it.