n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
284 stars 56 forks source link

Specifying a validation set #66

Open FOX111 opened 4 years ago

FOX111 commented 4 years ago

I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb

While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch() I'm getting this output

Running tokenization: 'lm-notst' ... Validation set not found using 10% of trn Data lm-notst, trn: 26925, val: 2991 Size of vocabulary: 15000 First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х'] Running tokenization: 'cls' ... Data cls, trn: 26925, val: 2991 Running tokenization: 'tst' ... /home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList. Your valid set contained the following unknown labels, the corresponding items have been discarded. 201, 119, 192, 162, 168... if getattr(ds, 'warn', False): warn(ds.warn) Data tst, trn: 2991, val: 7448

I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?

Qe42 commented 4 years ago

name your files: train.csv, dev.csv, test.csv and unsup.csv or read the from_df options