Open taimir opened 8 years ago
@taimir It looks helpful. I found also this one but I doubt that any of them will improve our predictions, since the results (the latent variables) will also be biased because of the contradictory and insufficient data. Hence, it might be good idea to present it on Friday (as a step we have thought about and implemented) but we should nowhere near hang too long on tweaking this algorithm.
It is not supposed to improve our accuracy, but to represent an attempt to recover latent factors. We're unsure of which features are beneficial for the overall predictions, thus we turn to a solution that could automatically extract the relevant ground truth by a low-dimensional approximation of the data.
Ok, but how do you interpret the results? Each game is assigned a 2,3...small_n-D number, but do you know which features contributed at most? If there's a cluster of, say, DE-FR and SPN-POR you know that there are the strong teams but if you cannot infer the actual features that led to this conclusion it is still useless from probabilistic-interpretational point of view, isn't it?
Not necessarily, you can either cluster the matches or cluster the separate teams, and then do the classification / regression in the reduced domain.
You are not guaranteed that in the new domain there will be an easier way to classify the data. But there might be, you do not know the exact transformation. It is a latent factor recovery, similarly to PCA.
From a probabilistic perspective the approach is not useless. You are doing a transformation of your data, you do not loose the ability to define confidence scores in the new space.
t-SNE would be useless if you necessarily want to be able to interpret original features as more or less important.
@gdikov Read this R package here and also here