Open blazejdolicki opened 4 years ago
My package versions differ slightly from those in requirements.txt, maybe sacremoses is related:
fire 0.3.0
sacremoses 0.0.38
sentencepiece 0.1.85
fastai 1.0.47
What I did?
pretrain-lm
branch because it has clear instructions how to pretrain LM (#57).bash prepare_wiki.sh de
python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
Setting LM weights seed seed to 0
Running tokenization: 'lm-notst' ...
Wiki text was split to 1 articles
Wiki text was split to 1 articles
Wiki text was split to 1 articles
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/multifit/multifit/__main__.py", line 16, in <module>
fire.Fire(Experiment())
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 468, in _Fire
target=component.__name__)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/ubuntu/multifit/multifit/training.py", line 587, in train_
self.pretrain_lm.train_(pretrain_dataset)
File "/home/ubuntu/multifit/multifit/training.py", line 275, in train_
learn = self.get_learner(data_lm=dataset.load_lm_databunch(bs=self.bs, bptt=self.bptt, limit=self.limit))
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 208, in load_lm_databunch
limit=limit)
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 258, in load_n_cache_databunch
databunch = self.databunch_from_df(bunch_class, train_df, valid_df, **args)
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 271, in databunch_from_df
**args)
File "/home/ubuntu/multifit/fastai_contrib/text_data.py", line 147, in make_data_bunch_from_df
TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fastai/data_block.py", line 434, in __init__
if not self.train.ignore_empty and len(self.train.items) == 0:
TypeError: len() of unsized object
From initial debugging,
train.items
is an ndarray with shape()
. When I print it, it returns articles in German. I suppose this part suggests a problemWiki text was split to 1 articles
- I reckon the wiki text should be split in more than 1 article. So maybe something goes wrong inread_wiki_articles()
indataset.py
... This is my educated guess, but I don't know where to go from here.