src-d / ml-backlog

Issues belonging to source{d}'s Machine Learning team which cannot be related to a specific repository.
0 stars 3 forks source link

Retrieving the docker image dataset #81

Closed glimow closed 5 years ago

glimow commented 5 years ago

The list of docker images available on dockerhub is not publicly available. Schermann et al. 2018 crawled github to find docker images repository, but this is not suitable for our case. I used the dockerhub website API, which has a bug and do not support fetching more than 10,000 results.

To solve that problem, I created a python script that simulates a web browser and crawls the docker image space by recursively making search queries to the docker API, iterating over possible querystrings to reduce the amount of results for each request. I was able to retrieve a list of 1.7 Million docker images.

vmarkovtsev commented 5 years ago

Tristan's internship has ended. The artifact is https://github.com/src-d/docker-image-analysis which is going to become public.