I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.
When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring |'ll| in their regex:
I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.
When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring
|'ll|
in their regex:https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
However in this repo there is a space in between the l's:
https://github.com/latitudegames/GPT-3-Encoder/blob/79387f4643d44117b67e55006acdeb758edd9e5d/encoder.py#L51
It also exists in the the JavaScript version:
https://github.com/latitudegames/GPT-3-Encoder/blob/79387f4643d44117b67e55006acdeb758edd9e5d/Encoder.js#L68
We also looked at how hugging face has implemented it and it also does not contain a space:
https://github.com/huggingface/transformers/blob/ac588594e29f39a235433f00108d1416fb7c7fe5/src/transformers/models/gpt2/tokenization_gpt2.py#L193
If the space is intentional I'd be very curious to understand why it exists :)
Thanks!