src-d / style-analyzer

Lookout Style Analyzer: fixing code formatting and typos during code reviews
GNU Affero General Public License v3.0
32 stars 21 forks source link

Change default identifiers dataset google drive id #771

Closed irinakhismatullina closed 5 years ago

irinakhismatullina commented 5 years ago

Loaded new dataset (1 million identifiers, as before), changes:

  1. Updated splits from the TokenParser.
  2. Merged rows with equal splits.
vmarkovtsev commented 5 years ago

@irinakhismatullina Did you use the freshly merged neural splitter?

irinakhismatullina commented 5 years ago

No, is it ready??? I didn't know that

irinakhismatullina commented 5 years ago

Did anybody check the quality? I would love to play with it, but after that I will probably have to restart all experiments from the zero point to get the best quality with the fresh data...

vmarkovtsev commented 5 years ago

The quality is the same as in the paper https://arxiv.org/abs/1805.11651

vmarkovtsev commented 5 years ago

Caution: it requires a GPU to work, otherwise you finish splitting one million by Christmas. So use the ML cluster.

irinakhismatullina commented 5 years ago

I'm always using the cluster:) So, what's best to do: look at the new splitter right away, or first add everything I've done with the old one?

vmarkovtsev commented 5 years ago

Add the old stuff first, that's safer.

You can contact @glimow about using the model, that's him who worked on it.

irinakhismatullina commented 5 years ago

Then this PR can be merged.