jzhoubu / vsearch

An Extensible Framework for Retrieval-Augmented LLM Applications: Learning Relevance Beyond Simple Similarity.
MIT License
42 stars 1 forks source link

How to get INVALID_TOKEN_IDS and VALID_TOKEN_IDS? #3

Closed Clementine24 closed 4 months ago

Clementine24 commented 5 months ago

Hello,

I noticed that you use INVALID_TOKEN_IDS in the code to pre-filter out some unwanted tokens from the vocabulary. I'm very curious about how this list of data is generated.

In the paper, I only found a mention of "discard the unused tokens, resulting in a vocabulary V with a size of |V|=29522," but I noticed that the actual length of VALID_TOKEN_IDS in the code is only 27623. Could you provide the specific method for generating INVALID_TOKEN_IDS?

Thank you very much for your attention to this issue.

jzhoubu commented 5 months ago

Hi, @Clementine24, thank you for your interest. Below is a function help filter out valid tokens from the BERT vocabulary.

def check_valid_token(token):
    punctuation_escaped = re.escape(string.punctuation)
    pattern = f"[a-z0-9{punctuation_escaped}]*"
    return bool(re.fullmatch(pattern, token)) and not (token.startswith('[') and token.endswith(']'))