[bug]: np.bincount prepends zeros in data/emnist.py

I checked this issue has not been duplicated.

Hi @charlesfrye , I'm not sure if this repo is accepting PRs, but I spotted a bug in the data/emnist.py file. It concerns the _sample_to_balance function and the usage of np.bincount in it here.

Because you offset the labels by NUM_SPECIAL_TOKENS here and here before calling the subsampling function, np.bincount will prepend zeros to the missing elements from 0 to y_min_element-1 inclusive and will bias the mean towards zero. This could lead to a smaller dataset.

Example behaviour of np.bincount:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])

I have proposed a solution to the described bug in this PR.

the-full-stack / fsdl-text-recognizer-2022

[bug]: np.bincount prepends zeros in data/emnist.py #85