Closed glimow closed 5 years ago
Thanks @glimow for the great issues! Did you request a repo to infra to put your code in it? I see you have a personal repo for now, but it'd be great to have it in src-d's org :) Let me know if you need help on that.
@m09 Yup I hosted this on a personal repo after asking @vmarkovtsev, because he wanted to double check the code before hosting it under sourced's banner.
also UPDATE: we have 50,000 images since python lib update. I suspect most of the limitations come from the amount of data that is limiting the embedding's quality. I will embed again when I have ~100k images.
@glimow Normally you run a basic frequency analysis and cut the rare libraries, that should help with embeddings.
@vmarkovtsev actually the prepscript already does that for me (by default skipping <5 occurence libs)
Tristan's internship has ended. The artifact is https://github.com/src-d/docker-image-analysis which is going to become public.
DockerHub stores millions of docker images, allowing anybody to download and run them. Among those images are databases, operating systems, web services, development environnements, applications, build tools and so on. For this reasons, DockerHub is an awesome dataset of millions of applications inside their own running environnement.
Thus, analysing this huge dataset could allow building tools to automatically generate server configuration given a code repository, or infer the cost of the cloud infrastructure needed to run a specific service.
My main goal is to explorate the feasability of such tools. It involves a lot of different subtasks: