ros-infrastructure / buildfarm_deployment

Apache License 2.0
30 stars 39 forks source link

Error in cleanup_docker_images.py causing nodes to fill up with Docker images. #155

Closed nuclearsandwich closed 6 years ago

nuclearsandwich commented 7 years ago

cleanup_docker_images.py which is present and unmodified in the xenial branch. is failing with the short trace

2017-08-15 15:14:10,744 Uncaught exception in cleanup_docker_image.py: Traceback (most recent call last):
  File "cleanup_docker_images.py", line 150, in <module>
    main()
  File "cleanup_docker_images.py", line 139, in main
    run_image_cleanup(args, minimum_age, dclient)
  File "cleanup_docker_images.py", line 64, in run_image_cleanup
    repo_tags = i['RepoTags'][0] if '<none>:<none>' not in i['RepoTags'] else dockerid
TypeError: argument of type 'NoneType' is not iterable

The python error itself isn't particularly interesting, it should probably be handled and logged. I haven't yet performed any analysis on the underlying cause which is the main reason for this issue. For now I'm overzealously cleaning up the testfarm nodes with a docker images | awk ' {print $3}' |xargs docker rmi to get them back online.

Resolving this is a blocker to complete the xenial migration as without resolving this someone will need to babysit the storage available on each node.

mikaelarguedas commented 7 years ago

@nuclearsandwich do you know the version of python-docker is used on the new farm vs the old farm ? I can give a hand looking into it if it helps unlocking the xenial buildfarm migration

nuclearsandwich commented 7 years ago

The version installed is docker-py 1.10.6. It looks like the docs now refer to the pypi package docker with a version 2.4.2. Updating it may make it Just Work :tm:. It isn't a consistent failure yet, I'll see if I can find a second node exhibiting the issue so I can try swapping the docker library and see if that resolves it.

nuclearsandwich commented 7 years ago

https://pypi.python.org/pypi/docker-py vs https://pypi.python.org/pypi/docker/

mikaelarguedas commented 7 years ago

I don't know how much of the "new features" we want to leverage on the farm. Looks like ubuntu xenial provides 1.9.0 (I would assume of the docker-py package). But yeah looks like they bumped versions a lot recently. Playing with 2.4.2 and update our code to match their new API may be worth it. Given that they bumped the major I'd expect some changes to be necessary on our side. It looks like this happens only when you get empty tag lists so maybe you'll need a specific set of images / tags to exhibit it. 2.4.2 seems to handle none:none tags for us https://github.com/docker/docker-py/blob/2.4.2/docker/models/images.py#L41-L44

nuclearsandwich commented 7 years ago

Yeah, I think it's reasonable to try and update the script to use an up-to-date version, we're using upstream rather than xenial Docker on the hosts, although I haven't come up with a plan to either pin the Docker version or keep configured hosts up-to-date with upstream so it's still an ambiguous target.

@mikaelarguedas do you want to peek at the upgrade path, otherwise I'll make time for it later this week.

mikaelarguedas commented 7 years ago

Looked a bit more into this. Looks like the bug has been fixed upstream and has been released in 2.0.2. Quite some api changes happened since they bumped the major. I opened #156 for the updated API. @nuclearsandwich If you can give it a shot and provide feedback that'd be awesome. Also I didnt see the python docker module referenced anywhere in this repo, do you know any other place where this module is being used?

nuclearsandwich commented 7 years ago

Also I didnt see the python docker module referenced anywhere in this repo, do you know any other place where this module is being used?

I'm not sure if you mean any other python scripts or where the python package is installed during puppet runs, for the xenial branch the latter is:

https://github.com/nuclearsandwich/buildfarm_deployment/blob/414af3fd0c0ad41397fb057b0354d36b9cff7f48/modules/profile/manifests/jenkins/agent.pp#L102-L106

I expect when we merge your python script changes there (during the refactor I made it so there's only one copy of the script in the sources) that I'll also update the puppet config to install docker rather than docker-py and I'll probably pin the version to the current one rather than always installing the latest.

mikaelarguedas commented 7 years ago

I was referring to any other python scripts using the docker module.

I expect when we merge your python script changes there (during the refactor I made it so there's only one copy of the script in the sources)

:+1: great, I was surprised to see 3 identical files. I can modify the PR to target the xenial branch so that you can integrate this change in your testing if it's desired

nuclearsandwich commented 7 years ago

I can modify the PR to target the xenial branch so that you can integrate this change in your testing if it's desired

That would be stunningly good of you, because of the organizational changes between the xenial branch and master it might be most straightforward to recreate the commits here and create a new PR rather than reparenting or rebasing the changes here.

Unless we also update the puppet here to use the newer docker python library we should merge these changes against master.

nuclearsandwich commented 6 years ago

Resolved now that Xenial is the default.