huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence. #1394

Closed junrae6454 closed 7 months ago

junrae6454 commented 7 months ago

Description

# this is working
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([Digits(individual_digits=True), ByteLevel(add_prefix_space=False)])

# this is not working
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([ByteLevel(add_prefix_space=False), Digits(individual_digits=True)])

Improving the mathematical capabilities of GPT models like ChatGPT can be achieved by dividing digits. However, when using a pre_tokenizer like the one mentioned, errors may occur during digit splitting.

When utilizing pre_tokenizer.sequence, there was an inconsistency in the results of byte-level tokenization based on the order of byte_level and digits. This issue has been resolved by removing characters identified as digits among those used in byte_level.

Changes

vocab changes

Before ```python # Before vocab -> Representing byte -> ascii / is_number Ā -> 0 -> '\x00' / False ā -> 1 -> '\x01' / False Ă -> 2 -> '\x02' / False ă -> 3 -> '\x03' / False Ą -> 4 -> '\x04' / False ą -> 5 -> '\x05' / False Ć -> 6 -> '\x06' / False ć -> 7 -> '\x07' / False Ĉ -> 8 -> '\x08' / False ĉ -> 9 -> '\t' / False Ċ -> 10 -> '\n' / False ċ -> 11 -> '\x0b' / False Č -> 12 -> '\x0c' / False č -> 13 -> '\r' / False Ď -> 14 -> '\x0e' / False ď -> 15 -> '\x0f' / False Đ -> 16 -> '\x10' / False đ -> 17 -> '\x11' / False Ē -> 18 -> '\x12' / False ē -> 19 -> '\x13' / False Ĕ -> 20 -> '\x14' / False ĕ -> 21 -> '\x15' / False Ė -> 22 -> '\x16' / False ė -> 23 -> '\x17' / False Ę -> 24 -> '\x18' / False ę -> 25 -> '\x19' / False Ě -> 26 -> '\x1a' / False ě -> 27 -> '\x1b' / False Ĝ -> 28 -> '\x1c' / False ĝ -> 29 -> '\x1d' / False Ğ -> 30 -> '\x1e' / False ğ -> 31 -> '\x1f' / False Ġ -> 32 -> ' ' / False ! -> 33 -> '!' / False " -> 34 -> '"' / False # -> 35 -> '#' / False $ -> 36 -> '$' / False % -> 37 -> '%' / False & -> 38 -> '&' / False ' -> 39 -> "'" / False ( -> 40 -> '(' / False ) -> 41 -> ')' / False * -> 42 -> '*' / False + -> 43 -> '+' / False , -> 44 -> ',' / False - -> 45 -> '-' / False . -> 46 -> '.' / False / -> 47 -> '/' / False 0 -> 48 -> '0' / True 1 -> 49 -> '1' / True 2 -> 50 -> '2' / True 3 -> 51 -> '3' / True 4 -> 52 -> '4' / True 5 -> 53 -> '5' / True 6 -> 54 -> '6' / True 7 -> 55 -> '7' / True 8 -> 56 -> '8' / True 9 -> 57 -> '9' / True : -> 58 -> ':' / False ; -> 59 -> ';' / False < -> 60 -> '<' / False = -> 61 -> '=' / False > -> 62 -> '>' / False ? -> 63 -> '?' / False @ -> 64 -> '@' / False A -> 65 -> 'A' / False B -> 66 -> 'B' / False C -> 67 -> 'C' / False D -> 68 -> 'D' / False E -> 69 -> 'E' / False F -> 70 -> 'F' / False G -> 71 -> 'G' / False H -> 72 -> 'H' / False I -> 73 -> 'I' / False J -> 74 -> 'J' / False K -> 75 -> 'K' / False L -> 76 -> 'L' / False M -> 77 -> 'M' / False N -> 78 -> 'N' / False O -> 79 -> 'O' / False P -> 80 -> 'P' / False Q -> 81 -> 'Q' / False R -> 82 -> 'R' / False S -> 83 -> 'S' / False T -> 84 -> 'T' / False U -> 85 -> 'U' / False V -> 86 -> 'V' / False W -> 87 -> 'W' / False X -> 88 -> 'X' / False Y -> 89 -> 'Y' / False Z -> 90 -> 'Z' / False [ -> 91 -> '[' / False \ -> 92 -> '\\' / False ] -> 93 -> ']' / False ^ -> 94 -> '^' / False _ -> 95 -> '_' / False ` -> 96 -> '`' / False a -> 97 -> 'a' / False b -> 98 -> 'b' / False c -> 99 -> 'c' / False d -> 100 -> 'd' / False e -> 101 -> 'e' / False f -> 102 -> 'f' / False g -> 103 -> 'g' / False h -> 104 -> 'h' / False i -> 105 -> 'i' / False j -> 106 -> 'j' / False k -> 107 -> 'k' / False l -> 108 -> 'l' / False m -> 109 -> 'm' / False n -> 110 -> 'n' / False o -> 111 -> 'o' / False p -> 112 -> 'p' / False q -> 113 -> 'q' / False r -> 114 -> 'r' / False s -> 115 -> 's' / False t -> 116 -> 't' / False u -> 117 -> 'u' / False v -> 118 -> 'v' / False w -> 119 -> 'w' / False x -> 120 -> 'x' / False y -> 121 -> 'y' / False z -> 122 -> 'z' / False { -> 123 -> '{' / False | -> 124 -> '|' / False } -> 125 -> '}' / False ~ -> 126 -> '~' / False ġ -> 127 -> '\x7f' / False Ģ -> 128 -> '\x80' / False ģ -> 129 -> '\x81' / False Ĥ -> 130 -> '\x82' / False ĥ -> 131 -> '\x83' / False Ħ -> 132 -> '\x84' / False ħ -> 133 -> '\x85' / False Ĩ -> 134 -> '\x86' / False ĩ -> 135 -> '\x87' / False Ī -> 136 -> '\x88' / False ī -> 137 -> '\x89' / False Ĭ -> 138 -> '\x8a' / False ĭ -> 139 -> '\x8b' / False Į -> 140 -> '\x8c' / False į -> 141 -> '\x8d' / False İ -> 142 -> '\x8e' / False ı -> 143 -> '\x8f' / False IJ -> 144 -> '\x90' / False ij -> 145 -> '\x91' / False Ĵ -> 146 -> '\x92' / False ĵ -> 147 -> '\x93' / False Ķ -> 148 -> '\x94' / False ķ -> 149 -> '\x95' / False ĸ -> 150 -> '\x96' / False Ĺ -> 151 -> '\x97' / False ĺ -> 152 -> '\x98' / False Ļ -> 153 -> '\x99' / False ļ -> 154 -> '\x9a' / False Ľ -> 155 -> '\x9b' / False ľ -> 156 -> '\x9c' / False Ŀ -> 157 -> '\x9d' / False ŀ -> 158 -> '\x9e' / False Ł -> 159 -> '\x9f' / False ł -> 160 -> '\xa0' / False ¡ -> 161 -> '¡' / False ¢ -> 162 -> '¢' / False £ -> 163 -> '£' / False ¤ -> 164 -> '¤' / False ¥ -> 165 -> '¥' / False ¦ -> 166 -> '¦' / False § -> 167 -> '§' / False ¨ -> 168 -> '¨' / False © -> 169 -> '©' / False ª -> 170 -> 'ª' / False « -> 171 -> '«' / False ¬ -> 172 -> '¬' / False Ń -> 173 -> '\xad' / False ® -> 174 -> '®' / False ¯ -> 175 -> '¯' / False ° -> 176 -> '°' / False ± -> 177 -> '±' / False ² -> 178 -> '²' / True ³ -> 179 -> '³' / True ´ -> 180 -> '´' / False µ -> 181 -> 'µ' / False ¶ -> 182 -> '¶' / False · -> 183 -> '·' / False ¸ -> 184 -> '¸' / False ¹ -> 185 -> '¹' / True º -> 186 -> 'º' / False » -> 187 -> '»' / False ¼ -> 188 -> '¼' / True ½ -> 189 -> '½' / True ¾ -> 190 -> '¾' / True ¿ -> 191 -> '¿' / False À -> 192 -> 'À' / False Á -> 193 -> 'Á' / False  -> 194 -> 'Â' / False à -> 195 -> 'Ã' / False Ä -> 196 -> 'Ä' / False Å -> 197 -> 'Å' / False Æ -> 198 -> 'Æ' / False Ç -> 199 -> 'Ç' / False È -> 200 -> 'È' / False É -> 201 -> 'É' / False Ê -> 202 -> 'Ê' / False Ë -> 203 -> 'Ë' / False Ì -> 204 -> 'Ì' / False Í -> 205 -> 'Í' / False Î -> 206 -> 'Î' / False Ï -> 207 -> 'Ï' / False Ð -> 208 -> 'Ð' / False Ñ -> 209 -> 'Ñ' / False Ò -> 210 -> 'Ò' / False Ó -> 211 -> 'Ó' / False Ô -> 212 -> 'Ô' / False Õ -> 213 -> 'Õ' / False Ö -> 214 -> 'Ö' / False × -> 215 -> '×' / False Ø -> 216 -> 'Ø' / False Ù -> 217 -> 'Ù' / False Ú -> 218 -> 'Ú' / False Û -> 219 -> 'Û' / False Ü -> 220 -> 'Ü' / False Ý -> 221 -> 'Ý' / False Þ -> 222 -> 'Þ' / False ß -> 223 -> 'ß' / False à -> 224 -> 'à' / False á -> 225 -> 'á' / False â -> 226 -> 'â' / False ã -> 227 -> 'ã' / False ä -> 228 -> 'ä' / False å -> 229 -> 'å' / False æ -> 230 -> 'æ' / False ç -> 231 -> 'ç' / False è -> 232 -> 'è' / False é -> 233 -> 'é' / False ê -> 234 -> 'ê' / False ë -> 235 -> 'ë' / False ì -> 236 -> 'ì' / False í -> 237 -> 'í' / False î -> 238 -> 'î' / False ï -> 239 -> 'ï' / False ð -> 240 -> 'ð' / False ñ -> 241 -> 'ñ' / False ò -> 242 -> 'ò' / False ó -> 243 -> 'ó' / False ô -> 244 -> 'ô' / False õ -> 245 -> 'õ' / False ö -> 246 -> 'ö' / False ÷ -> 247 -> '÷' / False ø -> 248 -> 'ø' / False ù -> 249 -> 'ù' / False ú -> 250 -> 'ú' / False û -> 251 -> 'û' / False ü -> 252 -> 'ü' / False ý -> 253 -> 'ý' / False þ -> 254 -> 'þ' / False ÿ -> 255 -> 'ÿ' / False ```

After ```python # After vocab -> Representing byte -> ascii / is_number Ā -> 0 -> '\x00' / False ā -> 1 -> '\x01' / False Ă -> 2 -> '\x02' / False ă -> 3 -> '\x03' / False Ą -> 4 -> '\x04' / False ą -> 5 -> '\x05' / False Ć -> 6 -> '\x06' / False ć -> 7 -> '\x07' / False Ĉ -> 8 -> '\x08' / False ĉ -> 9 -> '\t' / False Ċ -> 10 -> '\n' / False ċ -> 11 -> '\x0b' / False Č -> 12 -> '\x0c' / False č -> 13 -> '\r' / False Ď -> 14 -> '\x0e' / False ď -> 15 -> '\x0f' / False Đ -> 16 -> '\x10' / False đ -> 17 -> '\x11' / False Ē -> 18 -> '\x12' / False ē -> 19 -> '\x13' / False Ĕ -> 20 -> '\x14' / False ĕ -> 21 -> '\x15' / False Ė -> 22 -> '\x16' / False ė -> 23 -> '\x17' / False Ę -> 24 -> '\x18' / False ę -> 25 -> '\x19' / False Ě -> 26 -> '\x1a' / False ě -> 27 -> '\x1b' / False Ĝ -> 28 -> '\x1c' / False ĝ -> 29 -> '\x1d' / False Ğ -> 30 -> '\x1e' / False ğ -> 31 -> '\x1f' / False Ġ -> 32 -> ' ' / False ! -> 33 -> '!' / False " -> 34 -> '"' / False # -> 35 -> '#' / False $ -> 36 -> '$' / False % -> 37 -> '%' / False & -> 38 -> '&' / False ' -> 39 -> "'" / False ( -> 40 -> '(' / False ) -> 41 -> ')' / False * -> 42 -> '*' / False + -> 43 -> '+' / False , -> 44 -> ',' / False - -> 45 -> '-' / False . -> 46 -> '.' / False / -> 47 -> '/' / False 0 -> 48 -> '0' / True 1 -> 49 -> '1' / True 2 -> 50 -> '2' / True 3 -> 51 -> '3' / True 4 -> 52 -> '4' / True 5 -> 53 -> '5' / True 6 -> 54 -> '6' / True 7 -> 55 -> '7' / True 8 -> 56 -> '8' / True 9 -> 57 -> '9' / True : -> 58 -> ':' / False ; -> 59 -> ';' / False < -> 60 -> '<' / False = -> 61 -> '=' / False > -> 62 -> '>' / False ? -> 63 -> '?' / False @ -> 64 -> '@' / False A -> 65 -> 'A' / False B -> 66 -> 'B' / False C -> 67 -> 'C' / False D -> 68 -> 'D' / False E -> 69 -> 'E' / False F -> 70 -> 'F' / False G -> 71 -> 'G' / False H -> 72 -> 'H' / False I -> 73 -> 'I' / False J -> 74 -> 'J' / False K -> 75 -> 'K' / False L -> 76 -> 'L' / False M -> 77 -> 'M' / False N -> 78 -> 'N' / False O -> 79 -> 'O' / False P -> 80 -> 'P' / False Q -> 81 -> 'Q' / False R -> 82 -> 'R' / False S -> 83 -> 'S' / False T -> 84 -> 'T' / False U -> 85 -> 'U' / False V -> 86 -> 'V' / False W -> 87 -> 'W' / False X -> 88 -> 'X' / False Y -> 89 -> 'Y' / False Z -> 90 -> 'Z' / False [ -> 91 -> '[' / False \ -> 92 -> '\\' / False ] -> 93 -> ']' / False ^ -> 94 -> '^' / False _ -> 95 -> '_' / False ` -> 96 -> '`' / False a -> 97 -> 'a' / False b -> 98 -> 'b' / False c -> 99 -> 'c' / False d -> 100 -> 'd' / False e -> 101 -> 'e' / False f -> 102 -> 'f' / False g -> 103 -> 'g' / False h -> 104 -> 'h' / False i -> 105 -> 'i' / False j -> 106 -> 'j' / False k -> 107 -> 'k' / False l -> 108 -> 'l' / False m -> 109 -> 'm' / False n -> 110 -> 'n' / False o -> 111 -> 'o' / False p -> 112 -> 'p' / False q -> 113 -> 'q' / False r -> 114 -> 'r' / False s -> 115 -> 's' / False t -> 116 -> 't' / False u -> 117 -> 'u' / False v -> 118 -> 'v' / False w -> 119 -> 'w' / False x -> 120 -> 'x' / False y -> 121 -> 'y' / False z -> 122 -> 'z' / False { -> 123 -> '{' / False | -> 124 -> '|' / False } -> 125 -> '}' / False ~ -> 126 -> '~' / False ġ -> 127 -> '\x7f' / False Ģ -> 128 -> '\x80' / False ģ -> 129 -> '\x81' / False Ĥ -> 130 -> '\x82' / False ĥ -> 131 -> '\x83' / False Ħ -> 132 -> '\x84' / False ħ -> 133 -> '\x85' / False Ĩ -> 134 -> '\x86' / False ĩ -> 135 -> '\x87' / False Ī -> 136 -> '\x88' / False ī -> 137 -> '\x89' / False Ĭ -> 138 -> '\x8a' / False ĭ -> 139 -> '\x8b' / False Į -> 140 -> '\x8c' / False į -> 141 -> '\x8d' / False İ -> 142 -> '\x8e' / False ı -> 143 -> '\x8f' / False IJ -> 144 -> '\x90' / False ij -> 145 -> '\x91' / False Ĵ -> 146 -> '\x92' / False ĵ -> 147 -> '\x93' / False Ķ -> 148 -> '\x94' / False ķ -> 149 -> '\x95' / False ĸ -> 150 -> '\x96' / False Ĺ -> 151 -> '\x97' / False ĺ -> 152 -> '\x98' / False Ļ -> 153 -> '\x99' / False ļ -> 154 -> '\x9a' / False Ľ -> 155 -> '\x9b' / False ľ -> 156 -> '\x9c' / False Ŀ -> 157 -> '\x9d' / False ŀ -> 158 -> '\x9e' / False Ł -> 159 -> '\x9f' / False ł -> 160 -> '\xa0' / False ¡ -> 161 -> '¡' / False ¢ -> 162 -> '¢' / False £ -> 163 -> '£' / False ¤ -> 164 -> '¤' / False ¥ -> 165 -> '¥' / False ¦ -> 166 -> '¦' / False § -> 167 -> '§' / False ¨ -> 168 -> '¨' / False © -> 169 -> '©' / False ª -> 170 -> 'ª' / False « -> 171 -> '«' / False ¬ -> 172 -> '¬' / False Ń -> 173 -> '\xad' / False ® -> 174 -> '®' / False ¯ -> 175 -> '¯' / False ° -> 176 -> '°' / False ± -> 177 -> '±' / False ń -> 178 -> '²' / False Ņ -> 179 -> '³' / False ´ -> 180 -> '´' / False µ -> 181 -> 'µ' / False ¶ -> 182 -> '¶' / False · -> 183 -> '·' / False ¸ -> 184 -> '¸' / False ņ -> 185 -> '¹' / False º -> 186 -> 'º' / False » -> 187 -> '»' / False Ň -> 188 -> '¼' / False ň -> 189 -> '½' / False ʼn -> 190 -> '¾' / False ¿ -> 191 -> '¿' / False À -> 192 -> 'À' / False Á -> 193 -> 'Á' / False  -> 194 -> 'Â' / False à -> 195 -> 'Ã' / False Ä -> 196 -> 'Ä' / False Å -> 197 -> 'Å' / False Æ -> 198 -> 'Æ' / False Ç -> 199 -> 'Ç' / False È -> 200 -> 'È' / False É -> 201 -> 'É' / False Ê -> 202 -> 'Ê' / False Ë -> 203 -> 'Ë' / False Ì -> 204 -> 'Ì' / False Í -> 205 -> 'Í' / False Î -> 206 -> 'Î' / False Ï -> 207 -> 'Ï' / False Ð -> 208 -> 'Ð' / False Ñ -> 209 -> 'Ñ' / False Ò -> 210 -> 'Ò' / False Ó -> 211 -> 'Ó' / False Ô -> 212 -> 'Ô' / False Õ -> 213 -> 'Õ' / False Ö -> 214 -> 'Ö' / False × -> 215 -> '×' / False Ø -> 216 -> 'Ø' / False Ù -> 217 -> 'Ù' / False Ú -> 218 -> 'Ú' / False Û -> 219 -> 'Û' / False Ü -> 220 -> 'Ü' / False Ý -> 221 -> 'Ý' / False Þ -> 222 -> 'Þ' / False ß -> 223 -> 'ß' / False à -> 224 -> 'à' / False á -> 225 -> 'á' / False â -> 226 -> 'â' / False ã -> 227 -> 'ã' / False ä -> 228 -> 'ä' / False å -> 229 -> 'å' / False æ -> 230 -> 'æ' / False ç -> 231 -> 'ç' / False è -> 232 -> 'è' / False é -> 233 -> 'é' / False ê -> 234 -> 'ê' / False ë -> 235 -> 'ë' / False ì -> 236 -> 'ì' / False í -> 237 -> 'í' / False î -> 238 -> 'î' / False ï -> 239 -> 'ï' / False ð -> 240 -> 'ð' / False ñ -> 241 -> 'ñ' / False ò -> 242 -> 'ò' / False ó -> 243 -> 'ó' / False ô -> 244 -> 'ô' / False õ -> 245 -> 'õ' / False ö -> 246 -> 'ö' / False ÷ -> 247 -> '÷' / False ø -> 248 -> 'ø' / False ù -> 249 -> 'ù' / False ú -> 250 -> 'ú' / False û -> 251 -> 'û' / False ü -> 252 -> 'ü' / False ý -> 253 -> 'ý' / False þ -> 254 -> 'þ' / False ÿ -> 255 -> 'ÿ' / False ```

Diff ```python # Diff vocab -> Representing byte -> ascii / is_number ² -> ń -> 178 -> '²' / False ³ -> Ņ -> 179 -> '³' / False ¹ -> ņ -> 185 -> '¹' / False ¼ -> Ň -> 188 -> '¼' / False ½ -> ň -> 189 -> '½' / False ¾ -> ʼn -> 190 -> '¾' / False ```
ArthurZucker commented 6 months ago

Hey, sorry for the delay, thanks for opening this. Might be right on this but this would be a bit breaking don't you think? 🤗