src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

UAST node role rec: Swivel and other embeddings #24

Closed marnovo closed 5 years ago

marnovo commented 7 years ago

EPIC: https://github.com/src-d/backlog/issues/858

Story: "As a data scientist or developer, I want the best tradeoff of trained models that can suggest node roles for a given UAST."

marnovo commented 7 years ago

Needs some bug fixing from ML side. Ongoing

marnovo commented 7 years ago

@fineguy in case you want to train any additional models please update the issue here.

marnovo commented 7 years ago

@fineguy update the issue with the developments since the last sprint started, please.

fineguy commented 7 years ago

I've trained embeddings using Swivel (/storage/timofei/role2vec/swivel) and GloVe (/storage/timofei/role2vec/glove). Node2Vec training is a lot slower - it scales poorly, so I never saw it finish training.

fineguy commented 7 years ago

Code is available here: https://github.com/fineguy/role2vec/tree/master/embeddings

EgorBu commented 7 years ago

@fineguy , can you share gist with code to reproduce your experiment with embeddings?

fineguy commented 7 years ago

Path to project folder: /storage/timofei/embeddings. All UASTs were randomly split into train, test and valid sets in 60%:20%:20% ratio. 1) uasts_train.txt, uasts_test.txt, uasts_valid.txt -- text files with paths to UASTs for train, test and valid sets. 2) prox_train.txt, prox_test.txt, prox_valid.txt -- directories with proximity matrices extracted accordingly for train, test and valid sets. Proximity matrices have depth 1 (i.e. regular co-occurrence matrices). Node2Vec is hard to scale, work in progress.

vmarkovtsev commented 7 years ago

Blocked because there is not dataset https://github.com/src-d/backlog/issues/1040

vmarkovtsev commented 5 years ago

This is outdated. @fineguy is no longer an intern at source{d}. We used the knowledge from this project in the next experiments.