I set "--preprocess-for-deep-nets True",but I just get a vocabulary with 14117 tokens,What should I do?
{'automatic_spell_check': True,
'group_gt_anno': True,
'min_word_freq': 5,
'n_train_examples': None,
'preprocess_for_deep_nets': True,
'random_seed': 2021,
'raw_artemis_data_csv': 'D:/ArtEmis/artemis-master/DataSet/ArtEmis/artemis_official_data/official_data/artemis_dataset_release_v0.csv',
'save_out_dir': 'step1_processed_data',
'split_loads': [0.85, 0.05, 0.1],
'too_high_repetition': 41,
'too_long_utter_prc': 95,
'too_short_len': 5}
454684 annotations were loaded
Using a 0.85,0.05,0.1 for train/val/test purposes
SymSpell spell-checker loaded: True
Loading glove word embeddings.
Done. 400000 words loaded.
Updating Glove vocabulary with valid ArtEmis words that are missing from it.
3057 annotations will be dropped as they contain less than 5 tokens
Too-long token length at 95-percentile is 30.0. 22196 annotations will be dropped
Using a vocabulary with 14117 tokens
n-utterances kept: 429431
vocab size: 14117
tokens not in Glove/Manual vocabulary: 1148
Done. Check saved results in provided save-out-dir: step1_processed_data
I set "--preprocess-for-deep-nets True",but I just get a vocabulary with 14117 tokens,What should I do? {'automatic_spell_check': True, 'group_gt_anno': True, 'min_word_freq': 5, 'n_train_examples': None, 'preprocess_for_deep_nets': True, 'random_seed': 2021, 'raw_artemis_data_csv': 'D:/ArtEmis/artemis-master/DataSet/ArtEmis/artemis_official_data/official_data/artemis_dataset_release_v0.csv', 'save_out_dir': 'step1_processed_data', 'split_loads': [0.85, 0.05, 0.1], 'too_high_repetition': 41, 'too_long_utter_prc': 95, 'too_short_len': 5} 454684 annotations were loaded Using a 0.85,0.05,0.1 for train/val/test purposes SymSpell spell-checker loaded: True Loading glove word embeddings. Done. 400000 words loaded. Updating Glove vocabulary with valid ArtEmis words that are missing from it. 3057 annotations will be dropped as they contain less than 5 tokens Too-long token length at 95-percentile is 30.0. 22196 annotations will be dropped Using a vocabulary with 14117 tokens n-utterances kept: 429431 vocab size: 14117 tokens not in Glove/Manual vocabulary: 1148 Done. Check saved results in provided save-out-dir: step1_processed_data