Closed vitojph closed 5 years ago
Just pushed a quick fix that should solve the problem. Please install Ludwig from the latest commit on master with pip install git+https://github.com/uber/ludwig.git
and let me know if the problem is solved.
Hi,
Great, thanks for the fix! It works now!
However, I just found out that some GloVe files don't have a regular format. For instance, glove.840B.300d.zip
is supposed to contain 301-item lines (the token and its 300 weights) but some lines do contain more than one str
tokens. As a consequence, when trying to convert some of these unexpected strings into floats in this line, it raises an exception.
Maybe we can safely ignore those lines not matching the expected embedding size.
Thanks again!
@vitojph could you please post here some examples line that raise exceptions so that I can figure out how to account for them?
hi @w4nderlust,
When loading ie. Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip, some lines seems to be ill-formed.
Take a look:
with open(PATH_TO_EMBEDDINGS, 'r', encoding='utf-8') as f:
...: for n, line in enumerate(f):
...: if f:
...: split = line.split()
...: if len(split) != 301:
...: print("line no: {} items per line: {} sample: {}".format(n, len(split), split[:10]))
...:
line no: 52343 items per line: 303 sample: ['.', '.', '.', '-0.1573', '-0.29517', '0.30453', '-0.54773', '0.098293', '-0.1776', '0.21662']
line no: 128261 items per line: 302 sample: ['at', 'name@domain.com', '0.0061218', '0.39595', '-0.22079', '0.78149', '0.38759', '0.28888', '0.18495', '-0.37328']
line no: 142318 items per line: 300 sample: ['0.20785', '0.2703', '0.93632', '-0.50861', '-0.36674', '-0.042177', '-0.37699', '0.051295', '0.61275', '-0.42422']
line no: 151102 items per line: 305 sample: ['.', '.', '.', '.', '.', '-0.23773', '-0.82788', '0.82326', '-0.91878', '0.35868']
line no: 200668 items per line: 302 sample: ['to', 'name@domain.com', '0.33865', '0.12698', '-0.16885', '0.55476', '0.48296', '0.45018', '0.0094233', '-0.36575']
line no: 209833 items per line: 302 sample: ['.', '.', '0.035974', '-0.024421', '0.71402', '-0.61127', '0.012771', '-0.11201', '0.16847', '-0.14069']
line no: 220779 items per line: 304 sample: ['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', '-0.56132', '0.60419', '-0.027276']
line no: 253461 items per line: 302 sample: ['email', 'name@domain.com', '0.33529', '0.32949', '0.2646', '0.64219', '0.70701', '-0.074487', '-0.066128', '-0.30804']
line no: 263028 items per line: 300 sample: ['0.39511', '0.37458', '0.24418', '-0.11774', '-0.22022', '-0.14198', '0.22348', '0.66478', '-0.055946', '-0.77057']
line no: 365745 items per line: 302 sample: ['or', 'name@domain.com', '0.48374', '0.49669', '-0.25089', '0.90389', '0.60307', '0.11141', '-0.021157', '0.10037']
line no: 484922 items per line: 300 sample: ['0.13211', '0.19999', '0.37907', '-1.0064', '-0.40911', '-0.51834', '-0.0023625', '0.72729', '0.32459', '-0.92157']
line no: 532048 items per line: 302 sample: ['contact', 'name@domain.com', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794', '0.060987', '0.31293']
line no: 538123 items per line: 300 sample: ['-0.38024', '0.61431', '0.81146', '-0.76394', '-0.19657', '0.11078', '-0.48388', '0.20633', '0.29338', '-1.1915']
line no: 557081 items per line: 300 sample: ['-0.0033421', '0.4899', '1.119', '-1.1039', '-0.43012', '-0.10575', '-0.41147', '0.41198', '0.4217', '-1.1474']
line no: 717302 items per line: 302 sample: ['Email', 'name@domain.com', '0.37344', '0.024573', '-0.12583', '0.36009', '0.25605', '0.07326', '0.3292', '-0.0037022']
The load_glove
function fails when converting to float some of the strings in those lines that don't have exactly 301 items.
I see. Thanks for inspecting this. I will do 2 things: 1) reproduce your result and try to figure out if those are actually malformed lines or if there are weird utf-8 characters that the split function doesn't recognize (or uses for splitting), 2) add more checks, discarding lines that are malformed. Stay tuned
So it turned out there are some weird utf-8 space characters in that file. The reason is likely that in the glove c code they split using only the ASCII space. I initially implemented a solution that was dealing with it by joining all the elements in the split that were not part of the embeddings, but that was slow and useless because those characters would not be present in tokens in Ludwig because of how Ludwig's tokenizers work. So in the end I ended up just skipping those "malformed" lines, printing a warning. @vitojph please let me know if now everything works as expected.
@w4nderlust I am trying to load embeddings for my text input features but as soon as I start to train the model i get the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2438: character maps to <undefined>
I already tried the it again by installing from pip install git+https://github.com/uber/ludwig.git
but it still doesn't work.
The commoncrawl embeddings file is from https://nlp.stanford.edu/projects/glove/
@abc1110 use the code from master: pip uninstall ludwig & pip install git+http://github.com/uber/ludwig.git
The issue has been solved in master, i was able to load all the pretrained GloVe embeddings files.
@w4nderlust I have tried that solution, but I am using the programmatic API where this issue persists..
@abc1110 the command line and API use exactly the same code. Please make sure you are running the updated code. If you are, and still get an error, please make a reproducible example, because I tested wit all those retrained embeddings files and they all work.
Hi all,
I'd like to use GloVe vectors as pretrained embeddings when training a text classifier. I downloaded the glove.840B.300d.zip vectors, unzipped them, and specified in my model definition file the following lines:
This is the error I get as soon as I launch the training:
All my files are encoded in UTF-8. I'm using Debian stable (Stretch), Python 3.6.8, ludwig 0.1.2 and TF 1.13.1.
Any ideas? Thanks in advance,