Closed m09 closed 6 years ago
One possible way that could be simpler than this one - use sparse matrix - it's supported by sklearn (link). So it will have a small overhead of additional array of ones for each feature that is not a problem.
I thought a bit about it and even with sparse matrices, we'll have to take care not to end up with too many features for the tree. At this point there is around ~130 possible roles and this gets multiplied by all the siblings (left, parents, right) ie, with default parameters, by 12.
This means that if we don't bucket we have around 1500 features just for the roles. In my experience this is way too many.
it is a lot I agree. I think once we definitely should make embeddings for Roles.
Yes if we are ok with having an external training phase it would be great to replace those with small embeddings (dimension 10 max?) to keep the feature space reasonnable.
Once we have embeddings we will not be able to explicitly express the learned rules, no?
Otherwise it is very easy to train - they are graph embeddings of each node. Not for Q3 though - too little time.
It's true that it's not interpretable.
I think that we should have a pretraining model for each language we analyze, so there is no additional time cost for one more training. So, it is something like take a PGA -- > get role co-occurrence matrix --> run swivel.
To get co-occurrence matrix we can do the same thing we do for identifiers: add +1 co-occurrence for Roles in one scope. It is easy to code, but a lot of time to get the model. So I think for now it is more like enhancement direction.
actually, it is interpretable. It each leaf in the tree we will have the subset of Roles. The same thing we have now I believe.
it is not interpretable in the sense that you can't modify the rules since they are operating in embedded space.
We can design an optimal bucketing scheme by clustering the roles embedding space.
I make experiments with one hot encoding preprocessing and here is the first results:
n_iter=10
freecodecamp
data from tests to test the performance. FormatAnalyzer.train(ptr, config, datastub)
)So, we can see some quality improvement. but the cost is a lot of time and memory consumption increase.
Now I am going to implement some kind of Role bucketing.
P.S.: Related PR https://github.com/src-d/style-analyzer/pull/116
@zurk BayesSearchCV is just a wrapper, if the underlying model supports sparse input then it should work. What is the error you get?
@zurk it's weird, their documentation does mention X : array-like or sparse matrix, shape = [n_samples, n_features] The training input samples.
.
I also think that it is strange. ok, let me look a little bit deeper.
yeap, I was able to fix it.
Are the running times any better?
Ok. here is my results.
freecodecamp
data for two dimensions 10 and 40. One-hot encoding takes so much time that I drop it without waiting. Here are the results for 5 runs: identity
is no Roles preprocessing, hashing 10
is hash features into 10 dimensions space, buckets 10
is bucketing roles features into 10 dim space and the same for hashing 40
and buckets 40
.INFO:time:Identity:
INFO:FeaturesExtractor:Features number: 34
DEBUG:FormatAnalyzer:score of the best estimator found: 0.989
36 rules, avg.len. 4.8
INFO:time:Training time: 19.6189
INFO:time:hashing 10:
INFO:FeaturesExtractor:Features number: 142
DEBUG:FormatAnalyzer:score of the best estimator found: 0.990
28 rules, avg.len. 4.4
INFO:time:Training time: 34.8752
INFO:time:buckets 10:
INFO:FeaturesExtractor:Features number: 142
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
35 rules, avg.len. 4.9
INFO:time:Training time: 33.4389
INFO:time:hashing 40:
INFO:FeaturesExtractor:Features number: 502
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
26 rules, avg.len. 4.7
INFO:time:Training time: 61.5031
INFO:time:buckets 40:
INFO:FeaturesExtractor:Features number: 502
DEBUG:FormatAnalyzer:score of the best estimator found: 0.991
28 rules, avg.len. 4.6
INFO:time:Training time: 49.0307
results seem OK, but I set n_iter=10
and get the same params for every preprocessing:
DEBUG:FormatAnalyzer:params of the best estimator found:
{'base_model_name': 'sklearn.tree.DecisionTreeClassifier',
'max_depth': 10, 'max_features': None, 'min_samples_leaf': 5, 'min_samples_split': 16}
It is weired. Any idea why? may be it is some bug in code..
What about results itself, hard to generalize looking only for one repo, But it looks like 0.1% of the performance gain does not worth +100% of training time.
my PR #116 is ready for review but I am not sure we should include this code if we have no plans to use it.
Thanks @zurk for the investigation + thorough reporting :+1:
I think we're going in the wrong direction here. "Direct bucketing" and "direct hashing" cannot help in our case. Bucketing as done in sklearn only reproduces the problem we have a certain number of times in smaller scales. It won't prevent the theoretical problem that the algorithm still can't find an entropy minimizing split if we have random features in each bucket. Hashing is even worse in a sense since murmur (the lib used by sklearn) is known to produce really good random hashes (see this excellent SO answer) which means that two related operators (==
and ===
) are as likely to have the same hash as ==
and ClassDeclaration
.
I'm not surprised at all by the fact that the hyperparameters stay the same across runs: the roles are still not used and max_features
is set to None
which allows the model to choose non-role features at every step. If we study the rules produced I would be very surprised to find a significant proportion of splits done on roles actually.
I think we have to go back to the clustering alternatives we discussed. We can for example use embeddings of the operators to then produce a hierarchical cluster and decide experimentally at which level of the hierarchy we should cluster. Those clusters could be visualized by the user which would still fit our interpretability requirement.
Another option would be to decide clusters with our domain knowledge, but obviously it will be painful this time and for every new language so I'd be opposed to it.
@m09 Here are my thoughts on this. Our problem is that we have too few splits on the role feature, correct? Maximum two actually. If we hash the roles into n dimensions then we will be able to split 2n times which is way better. Can this simple hack work?
I am afraid that any other co-occurrence heuristic will not help us because each role is unique and the co-occurrence does not mean anything for the rules. E.g. the fact that ==
and ===
are in the same cluster/bucket/embedding does not have any direct consequences with the spacing rules, etc. (this particular example is bad though). So we need to group by the similarity of the resulting rules or neighbor predicted class patterns, not by co-occurrence.
Let's remember that Babelfish gives us awesome role annotations. Each internal type is actually a set of universal roles, and their factual number is much lower. Let's conduct a quick experiment and use them instead of the precise internal types and see what happens.
Another idea: we have never plotted the frequency histogram of the internal type usage. I suspect that it is very unbalanced. Thus we can one-hot the core N types and hash/bucket/whatever complex our crazy genious suggests the rest.
Oh I read too fast about the hash method. I thought we were just hashing to n_features values, which should not make any difference compared to what we have now so that explains the difference in performance.
Still, bucketing gives us guaranteed n_buckets * 2
separable values while hash gives us n_features * 2
separable values that can collide (if the extrema in several dimensions are the same role). Neither are really what we want but bucketing seems preferable to me if we have to choose.
I am afraid that any other co-occurrence heuristic will not help us because each role is unique and the co-occurrence does not mean anything for the rules.
Co-occurrences do encode syntax though and syntax is extremely relevant to formatting so to me it looks like a good way to anchor roles in our problem, even though not perfect I agree.
Each internal type is actually a set of universal roles, and their factual number is much lower.
This seems like a great direction for future experiments. A latttice of universal roles is much simpler to split than the internal types we currently have for a decision tree :+1:
So the first thing I want to try is
@zurk Regarding the universal 119 roles - not everything is used for every language. We can further coerce rare roles into one.
Yup, you are right. I calculate frequencies for JS dataset, that Warren collected (images are clickable):
2 * mean frequency
. 25 buckets, the last bucket contains 118 roles:
2 * mean frequency
. 29 buckets, the last bucket contains 41 roles:
So, the last two options look promising. I will implement both
There is a clear tradeoff between the granularity of roles and the model's accuracy. You are using 2 * mean
but actually this is a hyperparameter which should be pre-tuned.
of course.
Related PR: https://github.com/src-d/style-analyzer/pull/124 I add a code to calculate internal and universal roles frequencies for bucketing implementation discussed above.
Done
We will file separate issues for specific problems.
So at this point roles are handled as ordinal data which is not allowing for good splits on roles. We should find a way to one-hot encode them. Just straight-up one-hot encoding is maybe not optimal (130+ classes) but can be an option. It should be compared with smaller number of buckets (~10, ~50) if the performance is not great.