UnicodeDecodeError when loading pretrained GloVe embeddings

vitojph commented 5 years ago

Hi all,

I'd like to use GloVe vectors as pretrained embeddings when training a text classifier. I downloaded the glove.840B.300d.zip vectors, unzipped them, and specified in my model definition file the following lines:

input_features:
    -
        name: text
        type: text
        encoder: rnn
        cell_type: lstm_cudnn
        pretrained_embeddings: /home/victor/data/glove.840B.300d.txt

This is the error I get as soon as I launch the training:

embeddings = load_glove(embeddings_path)
  File "/home/victor/ludwig/lib/python3.6/site-packages/ludwig/utils/data_utils.py", line 178, in load_glove
    for line in f:
  File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1387: ordinal not in range(128)

All my files are encoded in UTF-8. I'm using Debian stable (Stretch), Python 3.6.8, ludwig 0.1.2 and TF 1.13.1.

Any ideas? Thanks in advance,

w4nderlust commented 5 years ago

Just pushed a quick fix that should solve the problem. Please install Ludwig from the latest commit on master with pip install git+https://github.com/uber/ludwig.git and let me know if the problem is solved.

vitojph commented 5 years ago

Hi,

Great, thanks for the fix! It works now!

However, I just found out that some GloVe files don't have a regular format. For instance, glove.840B.300d.zip is supposed to contain 301-item lines (the token and its 300 weights) but some lines do contain more than one str tokens. As a consequence, when trying to convert some of these unexpected strings into floats in this line, it raises an exception.

Maybe we can safely ignore those lines not matching the expected embedding size.

Thanks again!

w4nderlust commented 5 years ago

@vitojph could you please post here some examples line that raise exceptions so that I can figure out how to account for them?

vitojph commented 5 years ago

hi @w4nderlust,

When loading ie. Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip, some lines seems to be ill-formed.

Take a look:

with open(PATH_TO_EMBEDDINGS, 'r', encoding='utf-8') as f: 
   ...:     for n, line in enumerate(f): 
   ...:         if f: 
   ...:             split = line.split() 
   ...:             if len(split) != 301: 
   ...:                 print("line no: {} items per line: {} sample: {}".format(n, len(split), split[:10])) 
   ...:                                                                                                                                                                     
line no: 52343 items per line: 303 sample: ['.', '.', '.', '-0.1573', '-0.29517', '0.30453', '-0.54773', '0.098293', '-0.1776', '0.21662']
line no: 128261 items per line: 302 sample: ['at', 'name@domain.com', '0.0061218', '0.39595', '-0.22079', '0.78149', '0.38759', '0.28888', '0.18495', '-0.37328']
line no: 142318 items per line: 300 sample: ['0.20785', '0.2703', '0.93632', '-0.50861', '-0.36674', '-0.042177', '-0.37699', '0.051295', '0.61275', '-0.42422']
line no: 151102 items per line: 305 sample: ['.', '.', '.', '.', '.', '-0.23773', '-0.82788', '0.82326', '-0.91878', '0.35868']
line no: 200668 items per line: 302 sample: ['to', 'name@domain.com', '0.33865', '0.12698', '-0.16885', '0.55476', '0.48296', '0.45018', '0.0094233', '-0.36575']
line no: 209833 items per line: 302 sample: ['.', '.', '0.035974', '-0.024421', '0.71402', '-0.61127', '0.012771', '-0.11201', '0.16847', '-0.14069']
line no: 220779 items per line: 304 sample: ['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', '-0.56132', '0.60419', '-0.027276']
line no: 253461 items per line: 302 sample: ['email', 'name@domain.com', '0.33529', '0.32949', '0.2646', '0.64219', '0.70701', '-0.074487', '-0.066128', '-0.30804']
line no: 263028 items per line: 300 sample: ['0.39511', '0.37458', '0.24418', '-0.11774', '-0.22022', '-0.14198', '0.22348', '0.66478', '-0.055946', '-0.77057']
line no: 365745 items per line: 302 sample: ['or', 'name@domain.com', '0.48374', '0.49669', '-0.25089', '0.90389', '0.60307', '0.11141', '-0.021157', '0.10037']
line no: 484922 items per line: 300 sample: ['0.13211', '0.19999', '0.37907', '-1.0064', '-0.40911', '-0.51834', '-0.0023625', '0.72729', '0.32459', '-0.92157']
line no: 532048 items per line: 302 sample: ['contact', 'name@domain.com', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794', '0.060987', '0.31293']
line no: 538123 items per line: 300 sample: ['-0.38024', '0.61431', '0.81146', '-0.76394', '-0.19657', '0.11078', '-0.48388', '0.20633', '0.29338', '-1.1915']
line no: 557081 items per line: 300 sample: ['-0.0033421', '0.4899', '1.119', '-1.1039', '-0.43012', '-0.10575', '-0.41147', '0.41198', '0.4217', '-1.1474']
line no: 717302 items per line: 302 sample: ['Email', 'name@domain.com', '0.37344', '0.024573', '-0.12583', '0.36009', '0.25605', '0.07326', '0.3292', '-0.0037022']

The load_glove function fails when converting to float some of the strings in those lines that don't have exactly 301 items.

w4nderlust commented 5 years ago

I see. Thanks for inspecting this. I will do 2 things: 1) reproduce your result and try to figure out if those are actually malformed lines or if there are weird utf-8 characters that the split function doesn't recognize (or uses for splitting), 2) add more checks, discarding lines that are malformed. Stay tuned

w4nderlust commented 5 years ago

So it turned out there are some weird utf-8 space characters in that file. The reason is likely that in the glove c code they split using only the ASCII space. I initially implemented a solution that was dealing with it by joining all the elements in the split that were not part of the embeddings, but that was slow and useless because those characters would not be present in tokens in Ludwig because of how Ludwig's tokenizers work. So in the end I ended up just skipping those "malformed" lines, printing a warning. @vitojph please let me know if now everything works as expected.

abc1110 commented 5 years ago

@w4nderlust I am trying to load embeddings for my text input features but as soon as I start to train the model i get the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2438: character maps to <undefined>

I already tried the it again by installing from pip install git+https://github.com/uber/ludwig.git but it still doesn't work.

The commoncrawl embeddings file is from https://nlp.stanford.edu/projects/glove/

w4nderlust commented 5 years ago

@abc1110 use the code from master: pip uninstall ludwig & pip install git+http://github.com/uber/ludwig.git The issue has been solved in master, i was able to load all the pretrained GloVe embeddings files.

abc1110 commented 5 years ago

@w4nderlust I have tried that solution, but I am using the programmatic API where this issue persists..

w4nderlust commented 5 years ago

@abc1110 the command line and API use exactly the same code. Please make sure you are running the updated code. If you are, and still get an error, please make a reproducible example, because I tested wit all those retrained embeddings files and they all work.

ludwig-ai / ludwig

UnicodeDecodeError when loading pretrained GloVe embeddings #336