shiranD / word_level_evaluation

In this repo we provide code and support for word evaluation for GPT, GPT-2, Bert, RoBerta following the paper:
2 stars 0 forks source link

from word_process import process #1

Open rajicon opened 1 year ago

rajicon commented 1 year ago

Is word_process supposed to be a file in this directory, or an external library? I can't figure out how to get it.

shiranD commented 1 year ago

word_process, which was supposed to be there was basically executed if a spelling of a word was not completed (you can see the check for that in line 86). Similar to Bert (in the line above it), the idea is to remove the space token (which is not the one indicative of word completion, rather is part of the subword system spelling), in order to concatenate to a following subword.

rajicon commented 1 year ago

Ok, that makes sense. Is it possible to get the file still?

shiranD commented 1 year ago

Glad it made sense! Unfortunately I have no longer access to the original server I developed this code on. The best advice I can give you is that you can run it first on BERT, and see how the regular expression is applied to the BERT tokenization, and then write a similar function based on the specific RoBerta tokenization. It should not be a complicated function. You just need to make sure that the RoBerta special`space' token is removed, and return the concatenated sequence (w/o spaces)