Closed Kreevoz closed 2 years ago
Strange, this is what I get from the tokenizer for that prompt:
tokenizing text
['Ġa']
['Ġcomfy']
['Ġchair']
['Ġthat']
['Ġlooks']
['Ġlike']
['Ġan']
['Ġavocado']
text tokens [0, 58, 29872, 2408, 766, 4126, 1572, 101, 16632, 2]
Strange indeed... I can't explain this behavior either. Doesn't seem to be an issue with parsing the text from the commandline.. I'll try setting up a few virtual environments with different python versions. 🤔
Exact same result using Python 3.9 in a fresh virtual environment. Edit: Tested 3.9.13 and 3.9.7, same behavior, so it is not a recent bug I suppose?
Do any packages here look out of order? (Other than the jaxlib which I installed manually from https://github.com/cloudhan/jax-windows-builder )
(vENV39) PS Y:\min-dalle> pip freeze
absl-py==1.1.0
certifi==2022.6.15
charset-normalizer==2.0.12
chex==0.1.3
cycler==0.11.0
dm-tree==0.1.7
etils==0.6.0
flatbuffers==2.0
flax==0.4.2
fonttools==4.33.3
idna==3.3
importlib-resources==5.8.0
jax==0.3.14
jaxlib @ file:///Y:/jaxlib-0.3.14%2Bcuda11.cudnn82-cp39-none-win_amd64.whl
kiwisolver==1.4.3
matplotlib==3.5.2
msgpack==1.0.4
numpy==1.23.0
opt-einsum==3.3.0
optax==0.1.2
packaging==21.3
Pillow==9.1.1
pyparsing==3.0.9
python-dateutil==2.8.2
requests==2.28.0
scipy==1.8.1
six==1.16.0
toolz==0.11.2
torch==1.12.0+cu116
torchaudio==0.12.0+cu116
torchvision==0.13.0+cu116
typing_extensions==4.2.0
urllib3==1.26.9
zipp==3.8.0
I've added a couple more print statements to see what the tokenizer is up to:
tokenizing text
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'om', 'f', 'y']
['Ġ', 'com', 'f', 'y']
['Ġ', 'com', 'fy']
['Ġ', 'com', 'fy']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'ir']
['Ġ', 'ch', 'air']
['Ġ', 'chair']
['Ġ', 'chair']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'at']
['Ġ', 'th', 'at']
['Ġ', 'th', 'at']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'ok', 's']
['Ġ', 'l', 'ook', 's']
['Ġ', 'look', 's']
['Ġ', 'look', 's']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'ik', 'e']
['Ġ', 'l', 'ike']
['Ġ', 'like']
['Ġ', 'like']
['Ġ', 'a', 'n']
['Ġ', 'a', 'n']
['Ġ', 'an']
['Ġ', 'an']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'ad', 'o']
['Ġ', 'a', 'v', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ado']
['Ġ', 'av', 'oc', 'ado']
Why would your tokenizer fail to complete the long words when executing on my hardware? 😵
Not sure. It works properly in colab too: https://colab.research.google.com/github/kuprel/min-dalle/blob/main/min_dalle.ipynb
I've got the same issue, also on Windows (running on CPU only as I have an AMD GPU)... I'll check my packages if there are any other similarities between your setup and mine!
I found a pattern for this bug!
Vocabulary entries that have an entry that begins with: Ġ
are not being matched by the tokenizer.
Tokens that do not start with that symbol will assemble successfully into long words. (I do not know why the list of tokens contains both types of entries?)
For instance: "project", "record", "management" will assemble into valid tokens. But: "projections", "recordings", "manage" will not, because they are only listed as "Ġprojections", "Ġrecordings", "Ġmanage" in the json files.
That is why it breaks up words into such odd chunks. It can only pick the ones that start without that special character!
So there must be platform differences between linux and windows in how that accented Ġ is parsed. Can you account for this in your tokenizer? Can we strip that out?
Do you get that Ġ character when you run this in python? print(chr(ord(" ") + 256))
Affirmative.
>>> print(chr(ord(" ") + 256))
Ġ
If you can figure out what will make it work on windows let me know. I don't have any windows machines
Yes, I got a fix. It was one of those annoying OS-specific things indeed.
You need to explicitly specify that the json files get parsed as utf-8.
On Windows the parser will default to the system locale unless specified (usually cp1252 or similar for english installs, different codepages for other languages). This causes the accented G to get lost/garbled up.
The fix is easily added in lines 16, 18 and 20 in the ./min_dalle/min_dalle.py
file:
with open(os.path.join(model_path, 'config.json'), 'r', encoding='utf8') as f:
self.config = json.load(f)
with open(os.path.join(model_path, 'vocab.json'), 'r', encoding='utf8') as f:
vocab = json.load(f)
with open(os.path.join(model_path, 'merges.txt'), 'r', encoding='utf8') as f:
This should not negatively impact how the code executes under linux. The output now conforms to your examples when executing on windows and the tokens are correct.
Awesome thanks. I just updated it. Does it work now?
Yep, just freshly cloned your repo to make sure it's all okay, and it is. Windows users may rejoice now!
I'm running into unexpected behavior of the text tokenizer, running this on Windows, Python 3.7 , in a virtual environment, using the supplied image_from_text.py script file.
The input text is tokenized in a way that breaks up the words, thus preventing the output from actually depicting what was requested:
'a comfy chair that looks like an avocado'
->'alien life'
->Since the wrong tokens were chosen, the model returns a generic gamer chair for the first prompt, and some petri dish for the second, which is expected given the garbled tokens.
I checked that the
tokenizer.json
files were downloaded correctly for both the mini and mega models and they are - manually searching for the words in them finds them in there without any issue.Is there a specific dependency for the text tokenizer that I'm unaware of or is this simply a bug?