[style-analyzer] Find right balance to one hot encode roles

m09 commented 6 years ago

So at this point roles are handled as ordinal data which is not allowing for good splits on roles. We should find a way to one-hot encode them. Just straight-up one-hot encoding is maybe not optimal (130+ classes) but can be an option. It should be compared with smaller number of buckets (~10, ~50) if the performance is not great.

EgorBu commented 6 years ago

One possible way that could be simpler than this one - use sparse matrix - it's supported by sklearn (link). So it will have a small overhead of additional array of ones for each feature that is not a problem.

m09 commented 6 years ago

I thought a bit about it and even with sparse matrices, we'll have to take care not to end up with too many features for the tree. At this point there is around ~130 possible roles and this gets multiplied by all the siblings (left, parents, right) ie, with default parameters, by 12.

This means that if we don't bucket we have around 1500 features just for the roles. In my experience this is way too many.

zurk commented 6 years ago

it is a lot I agree. I think once we definitely should make embeddings for Roles.

m09 commented 6 years ago

Yes if we are ok with having an external training phase it would be great to replace those with small embeddings (dimension 10 max?) to keep the feature space reasonnable.

vmarkovtsev commented 6 years ago

Once we have embeddings we will not be able to explicitly express the learned rules, no?

Otherwise it is very easy to train - they are graph embeddings of each node. Not for Q3 though - too little time.

m09 commented 6 years ago

It's true that it's not interpretable.

zurk commented 6 years ago

I think that we should have a pretraining model for each language we analyze, so there is no additional time cost for one more training. So, it is something like take a PGA -- > get role co-occurrence matrix --> run swivel.

To get co-occurrence matrix we can do the same thing we do for identifiers: add +1 co-occurrence for Roles in one scope. It is easy to code, but a lot of time to get the model. So I think for now it is more like enhancement direction.

zurk commented 6 years ago

actually, it is interpretable. It each leaf in the tree we will have the subset of Roles. The same thing we have now I believe.

m09 commented 6 years ago

it is not interpretable in the sense that you can't modify the rules since they are operating in embedded space.

vmarkovtsev commented 6 years ago

We can design an optimal bucketing scheme by clustering the roles embedding space.

zurk commented 6 years ago

I make experiments with one hot encoding preprocessing and here is the first results:

BayesSearchCV gives a weird error if I use sparse matrix. So, It just cannot handle it. To test everything I convert it to regular ndarray (yeap, memory inefficient). I set n_iter=10
I use freecodecamp data from tests to test the performance.
Training without one-hot-encoding gives 34 features and 0.989 score points while takes ~19.3 seconds to train (FormatAnalyzer.train(ptr, config, datastub))
Training with one-hot-encoding gives 2902 features and 0.992 score points while takes ~182 seconds to train.

So, we can see some quality improvement. but the cost is a lot of time and memory consumption increase.

Now I am going to implement some kind of Role bucketing.

vmarkovtsev commented 6 years ago

@zurk BayesSearchCV is just a wrapper, if the underlying model supports sparse input then it should work. What is the error you get?

m09 commented 6 years ago

@zurk it's weird, their documentation does mention X : array-like or sparse matrix, shape = [n_samples, n_features] The training input samples..

zurk commented 6 years ago

I also think that it is strange. ok, let me look a little bit deeper.

zurk commented 6 years ago

yeap, I was able to fix it.

m09 commented 6 years ago

Are the running times any better?

zurk commented 6 years ago

Ok. here is my results.

I implement several ways of Roles features preprocessing: a. one-hot encoding b. bucketing (split categorical features to several categorical (see here: https://github.com/src-d/style-analyzer/pull/116/files#diff-f778fb4562b4ded91223ccd22dd95b08R199) ) c. feature hashing.
I measure time and performance on our test freecodecamp data for two dimensions 10 and 40. One-hot encoding takes so much time that I drop it without waiting. Here are the results for 5 runs: identity is no Roles preprocessing, hashing 10 is hash features into 10 dimensions space, buckets 10 is bucketing roles features into 10 dim space and the same for hashing 40 and buckets 40.

INFO:time:Identity:
INFO:FeaturesExtractor:Features number: 34
DEBUG:FormatAnalyzer:score of the best estimator found: 0.989
36 rules, avg.len. 4.8
INFO:time:Training time: 19.6189

INFO:time:hashing 10:
INFO:FeaturesExtractor:Features number: 142
DEBUG:FormatAnalyzer:score of the best estimator found: 0.990
28 rules, avg.len. 4.4
INFO:time:Training time: 34.8752

INFO:time:buckets 10:
INFO:FeaturesExtractor:Features number: 142
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
35 rules, avg.len. 4.9
INFO:time:Training time: 33.4389

INFO:time:hashing 40:
INFO:FeaturesExtractor:Features number: 502
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
26 rules, avg.len. 4.7
INFO:time:Training time: 61.5031

INFO:time:buckets 40:
INFO:FeaturesExtractor:Features number: 502
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
28 rules, avg.len. 4.6
INFO:time:Training time: 49.0307

results seem OK, but I set n_iter=10 and get the same params for every preprocessing:

DEBUG:FormatAnalyzer:params of the best estimator found: 
{'base_model_name': 'sklearn.tree.DecisionTreeClassifier', 
'max_depth': 10, 'max_features': None, 'min_samples_leaf': 5, 'min_samples_split': 16}

It is weired. Any idea why? may be it is some bug in code..

What about results itself, hard to generalize looking only for one repo, But it looks like 0.1% of the performance gain does not worth +100% of training time.

zurk commented 6 years ago

my PR #116 is ready for review but I am not sure we should include this code if we have no plans to use it.

m09 commented 6 years ago

Thanks @zurk for the investigation + thorough reporting :+1:

I think we're going in the wrong direction here. "Direct bucketing" and "direct hashing" cannot help in our case. Bucketing as done in sklearn only reproduces the problem we have a certain number of times in smaller scales. It won't prevent the theoretical problem that the algorithm still can't find an entropy minimizing split if we have random features in each bucket. Hashing is even worse in a sense since murmur (the lib used by sklearn) is known to produce really good random hashes (see this excellent SO answer) which means that two related operators (== and ===) are as likely to have the same hash as == and ClassDeclaration.

I'm not surprised at all by the fact that the hyperparameters stay the same across runs: the roles are still not used and max_features is set to None which allows the model to choose non-role features at every step. If we study the rules produced I would be very surprised to find a significant proportion of splits done on roles actually.

I think we have to go back to the clustering alternatives we discussed. We can for example use embeddings of the operators to then produce a hierarchical cluster and decide experimentally at which level of the hierarchy we should cluster. Those clusters could be visualized by the user which would still fit our interpretability requirement.

Another option would be to decide clusters with our domain knowledge, but obviously it will be painful this time and for every new language so I'd be opposed to it.

vmarkovtsev commented 6 years ago

@m09 Here are my thoughts on this. Our problem is that we have too few splits on the role feature, correct? Maximum two actually. If we hash the roles into n dimensions then we will be able to split 2n times which is way better. Can this simple hack work?

I am afraid that any other co-occurrence heuristic will not help us because each role is unique and the co-occurrence does not mean anything for the rules. E.g. the fact that == and === are in the same cluster/bucket/embedding does not have any direct consequences with the spacing rules, etc. (this particular example is bad though). So we need to group by the similarity of the resulting rules or neighbor predicted class patterns, not by co-occurrence.

Let's remember that Babelfish gives us awesome role annotations. Each internal type is actually a set of universal roles, and their factual number is much lower. Let's conduct a quick experiment and use them instead of the precise internal types and see what happens.

Another idea: we have never plotted the frequency histogram of the internal type usage. I suspect that it is very unbalanced. Thus we can one-hot the core N types and hash/bucket/whatever complex our crazy genious suggests the rest.

m09 commented 6 years ago

Oh I read too fast about the hash method. I thought we were just hashing to n_features values, which should not make any difference compared to what we have now so that explains the difference in performance.

Still, bucketing gives us guaranteed n_buckets * 2 separable values while hash gives us n_features * 2 separable values that can collide (if the extrema in several dimensions are the same role). Neither are really what we want but bucketing seems preferable to me if we have to choose.

I am afraid that any other co-occurrence heuristic will not help us because each role is unique and the co-occurrence does not mean anything for the rules.

Co-occurrences do encode syntax though and syntax is extremely relevant to formatting so to me it looks like a good way to anchor roles in our problem, even though not perfect I agree.

Each internal type is actually a set of universal roles, and their factual number is much lower.

This seems like a great direction for future experiments. A latttice of universal roles is much simpler to split than the internal types we currently have for a decision tree :+1:

zurk commented 6 years ago

Did not think about collisions, good point, @m09. But I am not sure we have them in our case when we embed ~200 features into ~10 dim space. Anyway, I also prefer bucketing to hashing. I code it just for experiments.
I like all ideas @vmarkovtsev suggested: a. Encode roles with respect to frequencies. b. Use bblfsh universal Roles to encode. However, still a lot of them -- 119. So it is only twice less. c. Encode Role by neighbors token Role. It exactly reflects things we are trying to predict.

So the first thing I want to try is

Plot distribution for internal Roles in JS
Plot distribution for Roles in JS
Measure time for embedding dimension of 119 (equal to Universal Roles use).

vmarkovtsev commented 6 years ago

@zurk Regarding the universal 119 roles - not everything is used for every language. We can further coerce rare roles into one.

zurk commented 6 years ago

Yup, you are right. I calculate frequencies for JS dataset, that Warren collected (images are clickable):

JS internal roles number: 151.
JS UAST roles number: 88.
Internal roles distribution (y scale is logarithmic here and everywhere):
UAST roles distribution:
Join Internal roles while sum frequency less than 2 * mean frequency. 25 buckets, the last bucket contains 118 roles:
Join UAST roles while sum frequency less than 2 * mean frequency. 29 buckets, the last bucket contains 41 roles:

zurk commented 6 years ago

So, the last two options look promising. I will implement both

vmarkovtsev commented 6 years ago

There is a clear tradeoff between the granularity of roles and the model's accuracy. You are using 2 * mean but actually this is a hyperparameter which should be pre-tuned.

zurk commented 6 years ago

of course.

zurk commented 6 years ago

Related PR: https://github.com/src-d/style-analyzer/pull/124 I add a code to calculate internal and universal roles frequencies for bucketing implementation discussed above.

vmarkovtsev commented 6 years ago

Done

vmarkovtsev commented 6 years ago

We will file separate issues for specific problems.

src-d / style-analyzer

[style-analyzer] Find right balance to one hot encode roles #92