As downloading all those images would be too long, I tried to eliminate the ones that we do not wanted to analyze: those not based on supported distributions and those that have no python and node packages. In order to do that we forked skopeo, an open source go tool to fetch additional metadata -including commands history- from the docker API. We released the 15 Gb of resulting data as the Dockerhub Metadata Dataset. Using this dataset, I was able to select images using python or nodejs in their command history. I selected 600,000 images using node and/or python along with one of the distribution we want to analyse.
But 600,000 images would still take around 15 days to process on 16 machines.
Fortunately, docker images have an inheritance system. Images are based on other images, each one representing a filesystem layer. When docker downloads an image, it checks if parent images are already downloaded to optimize speed and bandwidth usage. The inheritance tree was available in the metadata we had, and by rebuilding it and separating smartly the list of images processed by each machine, we were able to reduce the size $S$ of data to download to $~ln(S)$.
Related to https://github.com/src-d/ml-backlog/issues/80
As downloading all those images would be too long, I tried to eliminate the ones that we do not wanted to analyze: those not based on supported distributions and those that have no python and node packages. In order to do that we forked skopeo, an open source go tool to fetch additional metadata -including commands history- from the docker API. We released the 15 Gb of resulting data as the Dockerhub Metadata Dataset. Using this dataset, I was able to select images using python or nodejs in their command history. I selected 600,000 images using node and/or python along with one of the distribution we want to analyse.
But 600,000 images would still take around 15 days to process on 16 machines.
Fortunately, docker images have an inheritance system. Images are based on other images, each one representing a filesystem layer. When docker downloads an image, it checks if parent images are already downloaded to optimize speed and bandwidth usage. The inheritance tree was available in the metadata we had, and by rebuilding it and separating smartly the list of images processed by each machine, we were able to reduce the size $S$ of data to download to $~ln(S)$.