While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()
I'm getting this output
Running tokenization: 'lm-notst' ...
Validation set not found using 10% of trn
Data lm-notst, trn: 26925, val: 2991
Size of vocabulary: 15000
First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х']
Running tokenization: 'cls' ...
Data cls, trn: 26925, val: 2991
Running tokenization: 'tst' ...
/home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
201, 119, 192, 162, 168...
if getattr(ds, 'warn', False): warn(ds.warn)
Data tst, trn: 2991, val: 7448
I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?
I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb
While running
cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()
I'm getting this outputI assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?