latitudegames / GPT-3-Encoder

Javascript BPE Encoder Decoder for GPT-2 / GPT-3
MIT License
716 stars 196 forks source link

Possible bug with BPE RegEx #5

Closed caleb-artifact closed 3 years ago

caleb-artifact commented 3 years ago

I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.

When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring |'ll| in their regex:

https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53

However in this repo there is a space in between the l's:

https://github.com/latitudegames/GPT-3-Encoder/blob/79387f4643d44117b67e55006acdeb758edd9e5d/encoder.py#L51

It also exists in the the JavaScript version:

https://github.com/latitudegames/GPT-3-Encoder/blob/79387f4643d44117b67e55006acdeb758edd9e5d/Encoder.js#L68

We also looked at how hugging face has implemented it and it also does not contain a space:

https://github.com/huggingface/transformers/blob/ac588594e29f39a235433f00108d1416fb7c7fe5/src/transformers/models/gpt2/tokenization_gpt2.py#L193

If the space is intentional I'd be very curious to understand why it exists :)

Thanks!