knkski / atai

Analyze This! AI competition
1 stars 1 forks source link

Improve preprocessing steps #6

Open knkski opened 6 years ago

knkski commented 6 years ago

Right now our preprocessing is limited to simply extracting the files and converting them to numpy arrays:

https://github.com/knkski/atai/blob/master/preprocess.py

There's some room for improvement here. Some options we could try out are:

knkski commented 6 years ago

Removing crazy/nonsensical fonts turned out to have a major impact on the performance. Here's a tensorboard chart showing all of the previous runs that I've done with various algorithms, maxing out at just shy of 95% accuracy:

cleaned

The green line at the top is the first run with cleaned data, with a relatively simple CNN network. Here's a closer look at performance between the regular and cleaned dataset, with an identical CNN network:

cleaned2

Some of that increase is probably due to the fact that there's just less data now (cleaning removed 5%, or 502809 vs 529119 records), but some of that performance increase is almost certainly due to the algorithm no longer having to fit images like this, which are simply the logo of a font website:

uvr5cgvtcxvhcmutqmxhy2tnywp1c2nszxnfcc5vdgy

There's also a fair number of fonts that have the digits 0-9 or 1-10 in place of actual letters, so the algorithm was trying to fit 1 or 2 instead of a B.

Next steps are to use the trained model on the font images, and then determine where the learning algorithm is having difficulty, and seeing if there's any way of fixing that, such as generating more, similar examples.