Open mariovas3 opened 7 months ago
Account for y labels being offset by NUM_SPECIAL_TOKENS when calling np.bincount in emnist balance subsampling.
y
NUM_SPECIAL_TOKENS
np.bincount
The offsetting is found here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L104
and here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L106
np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.
0
y_min_element-1
Example bug:
>>> import numpy as np >>> y = np.array([0, 1, 0, 2, 1]) >>> np.bincount(y) array([2, 2, 1]) >>> NUM_SPECIAL_TOKENS = 4 >>> np.bincount(y + NUM_SPECIAL_TOKENS) array([0, 0, 0, 0, 2, 2, 1])
Account for
y
labels being offset byNUM_SPECIAL_TOKENS
when callingnp.bincount
in emnist balance subsampling.The offsetting is found here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L104
and here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L106
np.bincount
will prepend zeros for elements that were not found starting from0
toy_min_element-1
; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.Example bug: