Question about Source Code

Hello,

I would first thank you for open-sourcing such a well-designed and high-quality code base.

I am reading the source code, and I have a question about this part(integrations.transformers.py):

def _build_regular_tokens_list(tokenizer: PreTrainedTokenizerBase) -> List[Tuple[int, str, bool]]:
    token_0 = tokenizer.encode("0")[-1]
    regular_tokens = []
    for token_idx in range(len(tokenizer)):
        if token_idx in tokenizer.all_special_ids:
            continue
        # We prepend token 0 and skip the first letter of the result to get a space if the token is a start word.
        decoded_after_0 = tokenizer.decode([token_0, token_idx])[1:]
        decoded_regular = tokenizer.decode([token_idx])
        is_word_start_token = len(decoded_after_0) > len(decoded_regular)
        regular_tokens.append((token_idx, decoded_after_0, is_word_start_token))
    return regular_tokens

Why the same token id is decoded twice here? And what does the "start word" mean in this context? thx

noamgat / lm-format-enforcer

Question about Source Code #117