Open irapha opened 7 years ago
Sahit seemed to have a similar issue, it seems to be that things hand when you connect to an already publishing node. No idea why unfortunately.
Is this still the case with test controller, I can't test it.
@chsahit ^
@chsahit bumperoni. I don't believe it is having issues on CI, but I'd still like to know.
@jgkamat
Ok, so this seems to fail almost every time on CI, but works 100% of the time locally. Because we're on docker, sshing into the CI environment is practically useless, since everything is done inside of a docker container created inside of the CI environment. I'm trying to run locally to see if I can repro, but I run into the following:
as part of our install script, I delete a symlink (https://github.com/gtagency/buzzmobile/blob/master/install#L44) created as part of the virtualenv creation process. This works as part of the install-at-startup, and works when run on Circle, but when I run locally, runtests.sh fails because this symlink doesn't exist. If I skip the step, it fails moments later on something else. Any ideas?
Hmm, I can try to take a look at this on tuesday earliest.
You can definetly still ssh via the circleci ssh feature, it's just a bit more complicated. Ssh in, then run a docker ps
to find the running container that's hanging, and then run docker exec -ti <container sha> bash
and you should get a shell up.
I'm a little bit confused as to why the symlink would be existing on the remote but not local containers. Maybe there's something slightly different about the environment variables being passed in or something?
I definetly noticed this problem earlier, but I didn't try and debug it. I have a feeling it has to do with a clean install of buzzmobile (if you get past one run, it suddenly becomes fine). I'll get back to you soon and let you know what I find out.
I can't, it fails with an error: Error response from daemon: Unsupported: Exec is not supported by the lxc driver
, which led down a fun rabbit hole that was ultimately unsuccessful :/
Apparently Circle doesn't use normal docker.
Alright, now to bamboozle everyone even more: I can't repro this on local docker container.
I ran, on my host machine,
sudo docker create -i -t arbitrary_value bash
. arbitrary_value here is the name of the docker image from buildbaseimage.sh I believe. Basically in our case its an empty docker container with the empty folder ~/catkin_ws/src/buzzmobile
defined.sudo docker start -a -i <THE_HASH_PRINTED>
Then, in the docker image
cd ..
(to ~/catkins_ws/src
)git clone https://github.com/gtagency/buzzmobile.git
cd buzzmobile
./install
(this succeeded, unlike trying to docker run
last week) (!!?????)rosrun
(failed, as expected)pytest
(failed, as expected)source bin/activate
ci_scripts/style
ci_scripts/unittest
(passed, did not hang)deactivate
ci_scripts/unittest
(passed, did not hang)ci_scripts/simulation
(failed with ImportError: No module named googlemapskey
as tracked in #112)I'm fully bamboozled here, because it is quite clearly hanging on CI, but its not here. As an attempted fix I guess we could try to have the install step run git clone instead of anything else, but that doesn't seem any better.
Yup, I can't seem to reproduce it locally on a single machine (after one run).
Since it seems flakey (not all the time) and it happens on CI but not on local, this smells like a deadlock/race condition that's going on, which is somehow made worse when you have less parallization (CI only has 2 threads).
It's possible some subtle change to CircleCI caused this to start triggering now. I have no idea how ros works though, so it might not be related.
I think this a semi-recent change though, since d5aef13 builds fine (at least for one try, you could try rebuilding it to see if it happens consistently). It seems to fail once I merge master in though.
I actually noticed this back a long time ago but I thought it was something up with your testing suite, so I didn't comment.
I would try disabling one of the two tests your running (at a time) and see if that helps to help narrow down the problem. I would start with test_controller, since the test I ran just now seems to be hanging on that (but I don't know how pytest output works, so I might be wrong.
I'm still leaning towards it being an issue in pyrostest, if only because https://circleci.com/gh/gtagency/buzzmobile/596 passes.
My guess would be that there's something that can spin if the context managers aren't being used correctly, but I'm not sure why exactly that would be the case.
Ok nevermind. I added an additional test in https://github.com/gtagency/pyrostest/pull/26, and that works just fine. So yeah, this appears to be a weird CI issue.
Added some new things to pyrostest that should mitigate this. In testing it appeared to make the tests fail early instead of hanging, so that's nice.
in a fresh install, it will hang on the first try of testing then it will pass on all subsequent times.
I'm marking low priority because Josh now has Google to worry about and because it does work most of the time. But I'd like to at least know why this happens..