the-full-stack / fsdl-text-recognizer-2022

Source of the FSDL 2022 labs, which are at https://github.com/full-stack-deep-learning/fsdl-text-recognizer-2022-labs
https://fullstackdeeplearning.com/course
MIT License
81 stars 26 forks source link

[bug]: np.bincount prepends zeros in data/emnist.py #85

Open mariovas3 opened 5 months ago

mariovas3 commented 5 months ago

I checked this issue has not been duplicated.

Hi @charlesfrye , I'm not sure if this repo is accepting PRs, but I spotted a bug in the data/emnist.py file. It concerns the _sample_to_balance function and the usage of np.bincount in it here.

Because you offset the labels by NUM_SPECIAL_TOKENS here and here before calling the subsampling function, np.bincount will prepend zeros to the missing elements from 0 to y_min_element-1 inclusive and will bias the mean towards zero. This could lead to a smaller dataset.

Example behaviour of np.bincount:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])

I have proposed a solution to the described bug in this PR.