src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

Library Embedding using co-occurence in Docker Images #84

Closed glimow closed 5 years ago

glimow commented 5 years ago

Embedding already has been used on software library or source code (Alon et al. 2019, Theeten et al. 2019). But we need to introduce two novel type of embeddings: docker image embedding and docker image libraries embedding. We used Swivel (Shazeer et al. 2016) as our embedding algorithm because of its performances running on GPU and the quality of the output vector space.

For now I only embedded the library space and not the image space, though I will probably do it before my internship ends to compare the two methods. To represent the library space -which is of dimension 300-, I used the t-SNE method for dimmensionality reduction

The green dots are the node packages, the red dots are the native packages, the blue ones are python packages. The radius is proportional to the logarithm of used disk space.

vmarkovtsev commented 5 years ago

Tristan's internship has ended. The artifact is https://github.com/src-d/docker-image-analysis which is going to become public.