Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example)

chenwydj commented 1 year ago

Describe the bug When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.

To Reproduce Steps to reproduce the behavior: bash ./scripts/run_enwik8_base.sh train

Expected behavior I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.

Logs Run training... Experiment dir : LM-TFM-enwik8/20230706-192048 Producing dataset enwik8... building vocab with min_freq=0, max_size=None final vocab size 204 from 204 unique tokens

/home/username/fastmoe/examples/transformer-xl/train.py(194)() -> ntokens = len(corpus.vocab) (Pdb) len(corpus.vocab) 204

Platform

Device: NVIDIA Quadro RTX 8000
OS: Ubuntu 18.04
CUDA version: 11.4
PyTorch version: 1.10.0

Additional context Add any other context about the problem here.

laekov commented 1 year ago

@xptree any ideas on this?

chenwydj commented 1 year ago

I just printed the corpus.vocab.sym2idx, which should be wrong. The key should be words. OrderedDict([('32', 0), ('101', 1), ('116', 2), ('97', 3), ('105', 4), ('111', 5), ('110', 6), ('114', 7), ('115', 8), ('108', 9), ('104', 10), ('100', 11), ('99', 12), ('117', 13), ('93', 14), ('91', 15), ('109', 16), ('112', 17), ('103', 18), ('102', 19), ('121', 20), ('98', 21), ('39', 22), ('119', 23), ('46', 24), ('44', 25), ('118', 26), ('59', 27), ('38', 28), ('124', 29), ('47', 30), ('49', 31), ('107', 32), ('61', 33), ('48', 34), ('67', 35), ('65', 36), ('58', 37), ('45', 38), ('84', 39), ('83', 40), ('60', 41), ('62', 42), ('50', 43), ('113', 44), ('73', 45), ('57', 46), ('42', 47), ('120', 48), ('41', 49), ('40', 50), ('66', 51), ('77', 52), ('80', 53), ('69', 54), ('68', 55), ('53', 56), ('51', 57), ('72', 58), ('70', 59), ('56', 60), ('52', 61), ('71', 62), ('82', 63), ('54', 64), ('76', 65), ('55', 66), ('78', 67), ('87', 68), ('122', 69), ('125', 70), ('123', 71), ('79', 72), ('106', 73), ('85', 74), ('74', 75), ('75', 76), ('208', 77), ('95', 78), ('195', 79), ('35', 80), ('86', 81), ('215', 82), ('90', 83), ('34', 84), ('89', 85), ('209', 86), ('128', 87), ('224', 88), ('184', 89), ('131', 90), ('92', 91), ('227', 92), ('37', 93), ('33', 94), ('176', 95), ('169', 96), ('206', 97), ('226', 98), ('130', 99), ('63', 100), ('88', 101), ('81', 102), ('161', 103), ('153', 104), ('43', 105), ('129', 106), ('188', 107), ('179', 108), ('216', 109), ('164', 110), ('181', 111), ('189', 112), ('148', 113), ('190', 114), ('173', 115), ('187', 116), ('186', 117), ('229', 118), ('225', 119), ('167', 120), ('217', 121), ('177', 122), ('178', 123), ('168', 124), ('149', 125), ('185', 126), ('197', 127), ('144', 128), ('147', 129), ('196', 130), ('207', 131), ('194', 132), ('180', 133), ('156', 134), ('132', 135), ('170', 136), ('166', 137), ('136', 138), ('182', 139), ('191', 140), ('9', 141), ('230', 142), ('141', 143), ('160', 144), ('175', 145), ('36', 146), ('152', 147), ('140', 148), ('165', 149), ('145', 150), ('94', 151), ('133', 152), ('163', 153), ('183', 154), ('171', 155), ('157', 156), ('137', 157), ('174', 158), ('134', 159), ('135', 160), ('236', 161), ('151', 162), ('231', 163), ('155', 164), ('201', 165), ('158', 166), ('138', 167), ('143', 168), ('150', 169), ('162', 170), ('159', 171), ('139', 172), ('172', 173), ('154', 174), ('126', 175), ('232', 176), ('235', 177), ('146', 178), ('233', 179), ('228', 180), ('202', 181), ('203', 182), ('142', 183), ('214', 184), ('237', 185), ('204', 186), ('219', 187), ('234', 188), ('213', 189), ('96', 190), ('218', 191), ('199', 192), ('64', 193), ('210', 194), ('239', 195), ('198', 196), ('211', 197), ('205', 198), ('212', 199), ('240', 200), ('222', 201), ('220', 202), ('200', 203)])

chenwydj commented 1 year ago

@laekov @xptree the problem is: for enwik8 the vocabulary.py should use train.txt.raw instead of train.txt

laekov / fastmoe

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163