riffatahmad / Data-Science-Project

0 stars 0 forks source link

Add embedding layers #8

Open wetherc opened 5 years ago

wetherc commented 5 years ago

Estimated time: 3 hours

Since we have some categorical variables in our dataset, we've had to one-hot encode them. This results in the creation of sparse matrices in our data. E.g., if we have a column for vehicle make, it might be able to take on the values "Honda" / "Toyota" / "Chevrolet" / etc.

After one-hot encoding, we now have a dataset that looks like

| OBS | CHEVY | TOYOTA | HONDA | ... |
|-----|-------|--------|-------|-----|
|  1  |   0   |    1   |   0   | ... |
|  2  |   1   |    0   |   0   | ... |
|  3  |   0   |    0   |   1   | ... |
|  4  |   0   |    0   |   1   | ... |

Basically, for most of the columns across a single observation, the values will overwhelmingly be 0. This generally does bad things to our model accuracy, so we will use embedding columns as a dimensionality reduction technique to turn these sparse matrices into dense ones. Google offers a good overview of these at https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html

wetherc commented 5 years ago

The model architecture should now look like

Input -> embedding layer -> hidden layer -> hidden layer -> output