pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.84k stars 489 forks source link

LLM tokenizer pretty decoding fix for emojis/unicode #1360

Closed craymichael closed 3 days ago

craymichael commented 4 days ago

Summary: Emojis are not well-handled in the current decoding logic (see example in test plan). What unfortunately happens is that emojis/unicode are tokenized as two symbols - I believe one is to indicate extended unicode (or maybe a type of unicode, e.g., emoji), and the second is the type (e.g., smiley face, omega, ...). This solution makes the assumption that these will always come in pairs and that the symbol "οΏ½" is returned by tokenizer if the symbol is unknown (we verify that this wasn't the intended symbol by running it back through tokenizer). This logic will break down if symbols are split up into 3 or more tokens.

Example: Input String: πŸ˜‚ Output Token IDs: list of length 2 Pretty Decoded Tokens: ['πŸ˜‚[1/2]', 'πŸ˜‚[2/2]']

Note that we cannot just output a single token here as we will be providing attributions for each of the token IDs. In attribution, all such cases here should really be grouped together so inputs are valid and attributions make sense.

Differential Revision: D63435671

facebook-github-bot commented 4 days ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 3 days ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 3 days ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 3 days ago

This pull request has been merged in pytorch/captum@bacac27bf384f03eaa21f2d77042593cb4c1b7f5.