karpathy / makemore

An autoregressive character-level language model for making more things
MIT License
2.47k stars 652 forks source link

remove duplicate words from the dataset #6

Open iamdoron opened 1 year ago

iamdoron commented 1 year ago

hi

thanks for your videos, just finished to watch the first part

when I tried to intersect between the test & train datasets I noticed some names repeat in the dataset

len(words) - len(list(set(words))) # 2539

it might create a bias in the test results and an additional small bias during training