Hi @charlesfrye , I'm not sure if this repo is accepting PRs, but I spotted a bug in the data/emnist.py file. It concerns the _sample_to_balance function and the usage of np.bincount in it here.
Because you offset the labels by NUM_SPECIAL_TOKENShere and here before calling the subsampling function, np.bincount will prepend zeros to the missing elements from 0 to y_min_element-1 inclusive and will bias the mean towards zero. This could lead to a smaller dataset.
I checked this issue has not been duplicated.
Hi @charlesfrye , I'm not sure if this repo is accepting PRs, but I spotted a bug in the
data/emnist.py
file. It concerns the_sample_to_balance
function and the usage ofnp.bincount
in it here.Because you offset the labels by
NUM_SPECIAL_TOKENS
here and here before calling the subsampling function,np.bincount
will prepend zeros to the missing elements from0
toy_min_element-1
inclusive and will bias the mean towards zero. This could lead to a smaller dataset.Example behaviour of
np.bincount
:I have proposed a solution to the described bug in this PR.