pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.95k stars 499 forks source link

Pretty-print tokens in `llm_attr` methods #1348

Closed craymichael closed 2 months ago

craymichael commented 2 months ago

Summary: Convert ids to tokens without ugly unicode characters (e.g., Ġ). See: https://github.com/huggingface/transformers/issues/4786 and https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2

This is the preferred function over tokenizer.convert_ids_to_tokens() for user-facing data.

Quote from links:

Spaces are converted in a special character (the Ġ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process

Differential Revision: D62672912

facebook-github-bot commented 2 months ago

This pull request was exported from Phabricator. Differential Revision: D62672912

facebook-github-bot commented 2 months ago

This pull request has been merged in pytorch/captum@6636f4da9a4bd1606111cc6118ba6d1043b202f6.