src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

BPE the identifier dataset and publish the new model #87

Open vmarkovtsev opened 5 years ago

vmarkovtsev commented 5 years ago

As discussed on the reading club, we should run https://github.com/google/sentencepiece on our identifiers dataset and produce a nice compact ASDF in Modelforge to embed any identifier. That model can be further integrated somewhere near our ID splitter.

vmarkovtsev commented 5 years ago

Assigning myself - please let me do something interesting along with the endless meetings, hehe.