n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
284 stars 56 forks source link

Cannot run examples / pytest tests: #54

Open repodiac opened 4 years ago

repodiac commented 4 years ago

I cannot make MultiFiT to work in my environment :-(

What I did was...

RESULT: I always get an UnicodeDecodeError

e.g. with the training command:

Max vocab: 60000 Cache dir: data/wiki/en-100/models/f60k Model dir: data/wiki/en-100/models/f60k/lstm_orig.m Wiki text was split to 28476 articles Wiki text was split to 60 articles Running tokenization lm... Traceback (most recent call last): File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/app/work/ulmfit/__main__.py", line 188, in <module> fire.Fire(ULMFiT()) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire target=component.__name__) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 164, in train_lm data_lm = self.load_wiki_data(bs=bs) if data_lm is None else data_lm File "/app/work/ulmfit/pretrain_lm.py", line 246, in load_wiki_data **args) File "/app/work/ulmfit/pretrain_lm.py", line 254, in lm_databunch return self.databunch(name, bunch_class=TextLMDataBunch, *args, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 279, in databunch **args) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 202, in from_df if cls==TextLMDataBunch: src = src.label_for_lm() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 480, in _inner self.process() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 534, in process for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 714, in process self.x.process(xp) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 84, in process for p in self.processor: p.process(self) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 296, in process for i in progress_bar(range(0,len(ds),self.chunksize), leave=False): File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 75, in __iter__ if self.auto_update: self.update(i+1) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 92, in update self.update_bar(val) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 104, in update_bar else: self.on_update(val, f'{100 * val/self.total:.2f}% [{val}/{self.total} {elapsed_t}<{remaining_t}{end}]') File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 274, in on_update if printing(): WRITER_FN(to_write, end = '\r') UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-35: ordinal not in range(128)

or with the tests (any!):

self = <encodings.ascii.IncrementalDecoder object at 0x7fdcdf958e10> input = b' \n = Valkyria Chronicles III = \n \n Senj\xc5\x8d no Valkyria 3 : <unk> Chronicles ( Japanese : \xe6\x88\xa6\xe5\xa...n force invading the Empire just following the two nations \' cease @-@ fire would certainly wreck their newfound peac' final = False

def decode(self, input, final=False): > return codecs.ascii_decode(input, self.errors)[0] E UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 39: ordinal not in range(128)

/home/user/miniconda/envs/py36/lib/python3.6/encodings/ascii.py:26: UnicodeDecodeError

Does anyone have a clue here? Thanks a lot in advance!

PiotrCzapla commented 4 years ago

Hi @repodiac it seems you are using the older scripts . python -m ulmfit lm doesn't look like the new framework. You might want to try either run the current framework and see if the issue is solved there. I think I've seen " 'ascii' codec can't encode characters" before, when loading tokenized datasets. The solution was to remove the cache files made by older fastai and recreate them. The new framework does this automatically.

Let me know if that helped.