the-full-stack / fsdl-text-recognizer-2022

Source of the FSDL 2022 labs, which are at https://github.com/full-stack-deep-learning/fsdl-text-recognizer-2022-labs
https://fullstackdeeplearning.com/course
MIT License
82 stars 26 forks source link

[fix]: fixes balanced subsampling bug in data/emnist.py #84

Open mariovas3 opened 7 months ago

mariovas3 commented 7 months ago

Account for y labels being offset by NUM_SPECIAL_TOKENS when calling np.bincount in emnist balance subsampling.

The offsetting is found here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L104

and here: https://github.com/the-full-stack/fsdl-text-recognizer-2022/blob/ac59bfe43ea3e1ef1e03e4fb3b1bcf715a973063/text_recognizer/data/emnist.py#L106

np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.

Example bug:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])