Closed ambsw-technology closed 7 years ago
On the off chance there was a bug in Paramiko, I tried upgrading to 2.x with no effect.
If I rollback to 0.3.10 and map 0.6.6, I get the following error in the same place:
docker.errors.APIError: 500 Server Error: Internal Server Error ("{"message":"Cannot link to a non running container: /graylog_map.elasticsearch AS /graylog_map.graylog/elasticsearch"}")
Does the docker-fabric
start logic recursively check dependencies? Obviously, it's following the requirements from nginx
to graylog
. However, graylog
depends on additional containers that aren't getting started in the right order..
OK. I'll open a separate issue for the running container issue. I hardcoded the start order and it worked fine on 0.3.10. I upgraded to 0.4.0 and 0.7.0 and the connection issue persists.
I restored some debug logging and here's what I'm seeing. This connection works:
-------------============= API ==============------------------
with get_connection_args:
{u'local_port': 22025, u'base_url': None, u'remote_port': None}
This connection fails:
-------------============= API ==============------------------
with get_connection_args:
{u'local_port': 22025, u'base_url': 'http://127.0.0.1:22026', u'remote_port': None}
... and the failing configuration is being built in the dependency path:
DEBUG:dockermap.map.state.base:Following dependency path for graylog_map.nginx.
DEBUG:dockermap.map.state.base:Dependency path at graylog_map.graylog, instances [None].
-------------============= API ==============------------------
with get_connection_args:
{u'local_port': 22025, u'base_url': 'http://127.0.0.1:22026', u'remote_port': None}
DEBUG:docker.auth.auth:Trying paths: ['/home/<domain>/<name>/.docker/config.json', '/home/<domain>/<name>/.dockercfg']
DEBUG:docker.auth.auth:Found file at path: /home/<domain>/<name>/.docker/config.json
This is actually the same client constructor code that I chased down two days ago so troubleshooting can resume with my last work on the topic:
OK. Looks like the offender is in dockermap.map.state.base line 116. This client_config is generated as part of the clients in 105 or 107. These are generated by the _policy.
I see a policy created in dockermap.map.client on line 102, but am still trying to chase it through the logic to confirm. Perhaps you know definitely either way (and can shortcut my search).
So the offending client is being created in the else
branch (line 107) with policy.__dict__
:
{
'_f_resolver': <dockermap.map.policy.dep.ContainerDependencyResolver object at 0x7f65a7178910>,
'_clients': {u'__default__': {'use_host_config': True}}, '_container_names': {}, '_images': {},
'_maps': {'graylog_map': <dockermap.map.container.ContainerMap object at 0x7f65a16dced0>},
'_r_resolver': <dockermap.map.policy.dep.ContainerDependencyResolver object at 0x7f65a2d75690>}
The only thing here that looks like it could even affect a client is use_host_config: True
(do you see something else?). I'm still tracking down where that might cause the host to be populated.
Fabric still uses Paramiko 1.x , so upgrading might cause additional issues.
(The most recent release of Fabric supports 2.x).
I have been looking at several parts of the connection creation process, but I am still not sure where the failing configuration comes in. None of Docker-Fabric, Docker-Map, or docker-py
appears to inject a base_url
starting with http:
. On the contrary, all occurrences are replaced with tcp:
in docker-py
.
Maybe I am misunderstanding some parts of your setup. For reproducing the issue, could you please clarify, what Fabric env
variables you are setting (e.g. host_string
, docker_local_tunnel_local_port
etc.) and how the Docker service is available (e.g. bound to address 127.0.0.1 port 2375 on the SSH host)?
I have now managed to reproduce the issue; it is in fact a side-effect. If the base_url
is None
, docker-py
sets it to a value. When you pass the existing client to a container_fabric
call (as you described in #13), this derives a DockerClientConfiguration
object from that which picks up the changed URL.
As this is intended to be possible, I will try to fix the behavior. In the meantime, you can try two ways to circumvent the issue:
One possibility is to simply skip passing a docker_client
to container_fabric
.
Another way is to create client configurations ahead of time, e.g.
env.docker_clients = {
'__default__': DockerClientConfiguration(fabric_host='127.0.0.1'),
}
# ...
cc = env.docker_clients['__default__']
docker_client = cc.get_client()
container_fabric_inst = container_fabric(docker_client=cc, container_maps=maps)
Got it. So here's my question... how does the first client successfully connect? If I understand you correctly, the second client gets its configuration from the first. When the second client tries to use that configuration, it fails. But somehow the first one connects with a bad configuration?
I appreciate that this could sounds counterproductive, but it seems to me like the first connection should fail. That way, the code would be obliged to configure the client in a durably correct way.
The first connection uses default arguments, makes its best guess what connection type it is (i.e. socket or TCP) and sets up the appropriate tunnel along with a modified base_url
pointing to 127.0.0.1
. The second connection picks up the base_url
from the existing client instance, but draws the wrong conclusions on the connection type. This occurs both on socket as well as TCP connections.
The base_url
modification is somewhat inevitable, since it's the purpose of Docker-Fabric that the user does not have to deal with setting up the URL pointing to the correct tunnel. It seems that ClientConfiguration.from_client
should however not use the modified values. I will make some changes there.
With Docker-Fabric 0.4.1
I changed the process so that the ClientConfiguration
is created first, also for the docker_fabric()
function. This avoids side-effects of the internal base_url
modification. Now the client instance or client configuration object can also be re-used in container_fabric()
.
(To ensure I'm not running into bugs in the CLI code, I switched back to the API. I also rebuilt the virtualenv from scratch so I didn't have lingering code modifications.)
Now I'm seeing the following (sanitized):
As part of my testing, I opened another SSH session on the build machine (while the first was failed/frozen) and ran a successful call using the socat port:
I'm not even sure what to check next.