Open repodiac opened 5 years ago
Hi @repodiac it seems you are using the older scripts . python -m ulmfit lm
doesn't look like the new framework. You might want to try either run the current framework and see if the issue is solved there.
I think I've seen " 'ascii' codec can't encode characters" before, when loading tokenized datasets. The solution was to remove the cache files made by older fastai and recreate them. The new framework does this automatically.
Let me know if that helped.
I cannot make MultiFiT to work in my environment :-(
What I did was...
pytest .
or training according to the examplepython -m ulmfit lm --dataset-path data/wiki/${LANG}-100 --tokenizer='f' --nl 3 --name 'orig' --max-vocab 60000 \ --lang ${LANG} --qrnn=False - train 10 --bs=50 --drop_mult=0 --label-smoothing-eps=0.0
RESULT: I always get an
UnicodeDecodeError
e.g. with the training command:
Max vocab: 60000
Cache dir: data/wiki/en-100/models/f60k
Model dir: data/wiki/en-100/models/f60k/lstm_orig.m
Wiki text was split to 28476 articles
Wiki text was split to 60 articles
Running tokenization lm...
Traceback (most recent call last): File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/app/work/ulmfit/__main__.py", line 188, in <module> fire.Fire(ULMFiT()) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire target=component.__name__) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 164, in train_lm data_lm = self.load_wiki_data(bs=bs) if data_lm is None else data_lm File "/app/work/ulmfit/pretrain_lm.py", line 246, in load_wiki_data **args) File "/app/work/ulmfit/pretrain_lm.py", line 254, in lm_databunch return self.databunch(name, bunch_class=TextLMDataBunch, *args, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 279, in databunch **args) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 202, in from_df if cls==TextLMDataBunch: src = src.label_for_lm() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 480, in _inner self.process() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 534, in process for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 714, in process self.x.process(xp) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 84, in process for p in self.processor: p.process(self) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 296, in process for i in progress_bar(range(0,len(ds),self.chunksize), leave=False): File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 75, in __iter__ if self.auto_update: self.update(i+1) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 92, in update self.update_bar(val) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 104, in update_bar else: self.on_update(val, f'{100 * val/self.total:.2f}% [{val}/{self.total} {elapsed_t}<{remaining_t}{end}]')
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 274, in on_update
if printing(): WRITER_FN(to_write, end = '\r')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-35: ordinal not in range(128)
or with the tests (any!):
self = <encodings.ascii.IncrementalDecoder object at 0x7fdcdf958e10>
input = b' \n = Valkyria Chronicles III = \n \n Senj\xc5\x8d no Valkyria 3 : <unk> Chronicles ( Japanese : \xe6\x88\xa6\xe5\xa...n force invading the Empire just following the two nations \' cease @-@ fire would certainly wreck their newfound peac'
final = False
def decode(self, input, final=False):
> return codecs.ascii_decode(input, self.errors)[0]
E UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 39: ordinal not in range(128)
/home/user/miniconda/envs/py36/lib/python3.6/encodings/ascii.py:26: UnicodeDecodeError
Does anyone have a clue here? Thanks a lot in advance!