src-d / style-analyzer

Lookout Style Analyzer: fixing code formatting and typos during code reviews
GNU Affero General Public License v3.0
32 stars 21 forks source link

Always add token itself to the candidates #714

Closed irinakhismatullina closed 5 years ago

irinakhismatullina commented 5 years ago

Right now the token is always added to the candidates during generation, but still will not be returned as a candidate, if there are too much candidates with higher probabilities.

The details should be discussed, so WIP for now. UPD I guess good to merge, small almost nothing-affecting changes.

EgorBu commented 5 years ago

TBH I don't think that it's useful because: 1) logic of preparation of identifier candidates depends on probability of suggested tokens 2) for each token there could be 3 suggestions by default 3) number of options is n_suggestions ** n_tokens_with_suggestions - even for 2 tokens it will be 9. 4) we suggest 3 identifiers -> so 100% that this added token will not appear in suggestions Example:

####################
Token_Taken_Tuken
('token', 1.0)  ('tokens', 0.51058453)  ('then', 0.5639854) 0.287962225952878
('token', 1.0)  ('tokens', 0.51058453)  ('tokens', 0.5639854)   0.287962225952878
('token', 1.0)  ('token', 0.50261325)   ('then', 0.5639854) 0.2834665365347
('token', 1.0)  ('token', 0.50261325)   ('tokens', 0.5639854)   0.2834665365347
('token', 1.0)  ('tokens', 0.51058453)  ('token', 0.5495695)    0.2806016802807747
('token', 1.0)  ('then', 0.49494833)    ('then', 0.5639854) 0.27914363412682164
('token', 1.0)  ('then', 0.49494833)    ('tokens', 0.5639854)   0.27914363412682164
('token', 1.0)  ('token', 0.50261325)   ('token', 0.5495695)    0.2762209042932753
('token', 1.0)  ('then', 0.49494833)    ('token', 0.5495695)    0.27200849874137845
####################
someWrang_Variuble_NOME
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('name', 0.5628399) 0.1995977709675805
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('node', 0.5628399) 0.1995977709675805
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('name', 0.5628399) 0.19694051441056953
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('node', 0.5628399) 0.19694051441056953
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('name', 0.5628399) 0.19548725903740585
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('node', 0.5628399) 0.19548725903740585
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('name', 0.5628399) 0.19534186226877293
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('node', 0.5628399) 0.19534186226877293
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('code', 0.54944277)    0.19484678832984798
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('name', 0.5628399) 0.1928847259611541
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('node', 0.5628399) 0.1928847259611541
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('code', 0.54944277)    0.19225278187680941
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('name', 0.5628399) 0.19131899642500183
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('node', 0.5628399) 0.19131899642500183
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('code', 0.54944277)    0.19083411802745195
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('code', 0.54944277)    0.19069218210670327
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('code', 0.54944277)    0.18829353248397804
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('code', 0.54944277)    0.18676507166984396
####################
some_wrang_variuble_nome
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('name', 0.5628399) 0.1995977709675805
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('node', 0.5628399) 0.1995977709675805
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('name', 0.5628399) 0.19694051441056953
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('node', 0.5628399) 0.19694051441056953
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('name', 0.5628399) 0.19548725903740585
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('node', 0.5628399) 0.19548725903740585
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('name', 0.5628399) 0.19534186226877293
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('node', 0.5628399) 0.19534186226877293
('some', 1.0)   ('range', 0.6094888)    ('variable', 0.581842)  ('code', 0.54944277)    0.19484678832984798
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('name', 0.5628399) 0.1928847259611541
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('node', 0.5628399) 0.1928847259611541
('some', 1.0)   ('wrap', 0.6013746) ('variable', 0.581842)  ('code', 0.54944277)    0.19225278187680941
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('name', 0.5628399) 0.19131899642500183
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('node', 0.5628399) 0.19131899642500183
('some', 1.0)   ('range', 0.6094888)    ('variables', 0.56985956)   ('code', 0.54944277)    0.19083411802745195
('some', 1.0)   ('lang', 0.596493)  ('variable', 0.581842)  ('code', 0.54944277)    0.19069218210670327
('some', 1.0)   ('wrap', 0.6013746) ('variables', 0.56985956)   ('code', 0.54944277)    0.18829353248397804
('some', 1.0)   ('lang', 0.596493)  ('variables', 0.56985956)   ('code', 0.54944277)    0.18676507166984396
####################
some_defenerely_wrang_variuble_nome
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variable', 0.56873894)    ('name', 0.5628399) 0.1903848695819964
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variable', 0.56873894)    ('node', 0.5628399) 0.1903848695819964
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variable', 0.56873894)    ('name', 0.5628399) 0.18856900860524536
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variable', 0.56873894)    ('node', 0.5628399) 0.18856900860524536
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variable', 0.56873894)    ('name', 0.5628399) 0.18809353545123167
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variable', 0.56873894)    ('node', 0.5628399) 0.18809353545123167
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variables', 0.5562698)    ('name', 0.5628399) 0.18621084478410124
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variables', 0.5562698)    ('node', 0.5628399) 0.18621084478410124
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variable', 0.56873894)    ('code', 0.54944277)    0.18585317964635104
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variables', 0.5562698)    ('name', 0.5628399) 0.1844347950001363
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variables', 0.5562698)    ('node', 0.5628399) 0.1844347950001363
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variable', 0.56873894)    ('code', 0.54944277)    0.18408054121628104
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variables', 0.5562698)    ('name', 0.5628399) 0.18396974618677503
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variables', 0.5562698)    ('node', 0.5628399) 0.18396974618677503
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variable', 0.56873894)    ('code', 0.54944277)    0.18361638564707042
('some', 1.0)   ('defenerely', 1.0) ('range', 0.59475017)   ('variables', 0.5562698)    ('code', 0.54944277)    0.18177850825930877
('some', 1.0)   ('defenerely', 1.0) ('wrong', 0.58907753)   ('variables', 0.5562698)    ('code', 0.54944277)    0.1800447334048005
('some', 1.0)   ('defenerely', 1.0) ('wrap', 0.5875922) ('variables', 0.5562698)    ('code', 0.54944277)    0.17959075404791294
irinakhismatullina commented 5 years ago

What exactly are you writing about? What is not useful?

In this PR i introduce very small change: always consider the token itself as a candidate (before it was done only if the token itself is inside the vocabulary, or there are no other possible candidates from the vocabulary). Here I just always add it.

Before i tried to do this, but experienced significant drop of quality. Now i also added new feature, indicaring that the token is out-of-vocabulary, that in theory should soften this drop (because it helps to distinguish between candidates with edit distance=0: correct in-vocabulary tokens that should be left as they are and wrong out-of-vocabulary tokens that should be changed to smth in-vocabulary).

As far as i understood, there was a request to always return the probability of the token itself. Here it is not done fully, we consider the token itself as a candidate of the same level, as others (before i excluded it from the suggestions in the analyzer).

zurk commented 5 years ago

So what did you change in this PR, @irinakhismatullina ?

irinakhismatullina commented 5 years ago

In this PR i introduce very small change: always consider the token itself as a candidate (before it was done only if the token itself is inside the vocabulary, or there are no other possible candidates from the vocabulary). Here I just always add it.

Before i tried to do this, but experienced significant drop of quality. Now i also added new feature, indicaring that the token is out-of-vocabulary, that in theory should soften this drop (because it helps to distinguish between candidates with edit distance=0: correct in-vocabulary tokens that should be left as they are and wrong out-of-vocabulary tokens that should be changed to smth in-vocabulary).

@zurk

vmarkovtsev commented 5 years ago

Merging now, but once we have the evaluation dataset we should reconsider this.