LLM tokenizer pretty decoding fix for emojis/unicode

craymichael commented 2 months ago

Summary: Emojis are not well-handled in the current decoding logic (see example in test plan). What unfortunately happens is that emojis/unicode are tokenized as two symbols - I believe one is to indicate extended unicode (or maybe a type of unicode, e.g., emoji), and the second is the type (e.g., smiley face, omega, ...). This solution makes the assumption that these will always come in pairs and that the symbol "�" is returned by tokenizer if the symbol is unknown (we verify that this wasn't the intended symbol by running it back through tokenizer). This logic will break down if symbols are split up into 3 or more tokens.

Example: Input String: 😂 Output Token IDs: list of length 2 Pretty Decoded Tokens: ['😂[1/2]', '😂[2/2]']

Note that we cannot just output a single token here as we will be providing attributions for each of the token IDs. In attribution, all such cases here should really be grouped together so inputs are valid and attributions make sense.

Differential Revision: D63435671

facebook-github-bot commented 2 months ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 2 months ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 2 months ago

This pull request was exported from Phabricator. Differential Revision: D63435671

facebook-github-bot commented 2 months ago

This pull request has been merged in pytorch/captum@bacac27bf384f03eaa21f2d77042593cb4c1b7f5.

pytorch / captum

LLM tokenizer pretty decoding fix for emojis/unicode #1360