Improve preprocessing steps

Removing crazy/nonsensical fonts turned out to have a major impact on the performance. Here's a tensorboard chart showing all of the previous runs that I've done with various algorithms, maxing out at just shy of 95% accuracy:

The green line at the top is the first run with cleaned data, with a relatively simple CNN network. Here's a closer look at performance between the regular and cleaned dataset, with an identical CNN network:

Some of that increase is probably due to the fact that there's just less data now (cleaning removed 5%, or 502809 vs 529119 records), but some of that performance increase is almost certainly due to the algorithm no longer having to fit images like this, which are simply the logo of a font website:

uvr5cgvtcxvhcmutqmxhy2tnywp1c2nszxnfcc5vdgy

There's also a fair number of fonts that have the digits 0-9 or 1-10 in place of actual letters, so the algorithm was trying to fit 1 or 2 instead of a B.

Next steps are to use the trained model on the font images, and then determine where the learning algorithm is having difficulty, and seeing if there's any way of fixing that, such as generating more, similar examples.

knkski / atai

Improve preprocessing steps #6