Pretty-print tokens in `llm_attr` methods

craymichael commented 2 months ago

Summary: Convert ids to tokens without ugly unicode characters (e.g., Ġ). See: https://github.com/huggingface/transformers/issues/4786 and https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2

This is the preferred function over tokenizer.convert_ids_to_tokens() for user-facing data.

Quote from links:

Spaces are converted in a special character (the Ġ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process

Differential Revision: D62672912

facebook-github-bot commented 2 months ago

This pull request was exported from Phabricator. Differential Revision: D62672912

facebook-github-bot commented 2 months ago

This pull request has been merged in pytorch/captum@6636f4da9a4bd1606111cc6118ba6d1043b202f6.

pytorch / captum

Pretty-print tokens in `llm_attr` methods #1348