gtagency / buzzmobile

An autonomous parade float/vehicle
MIT License
18 stars 3 forks source link

test_controller still hangs sometimes #192

Open irapha opened 7 years ago

irapha commented 7 years ago

in a fresh install, it will hang on the first try of testing then it will pass on all subsequent times.

I'm marking low priority because Josh now has Google to worry about and because it does work most of the time. But I'd like to at least know why this happens..

joshuamorton commented 7 years ago

Sahit seemed to have a similar issue, it seems to be that things hand when you connect to an already publishing node. No idea why unfortunately.

joshuamorton commented 7 years ago

Is this still the case with test controller, I can't test it.

irapha commented 7 years ago

@chsahit ^

joshuamorton commented 7 years ago

@chsahit bumperoni. I don't believe it is having issues on CI, but I'd still like to know.

joshuamorton commented 7 years ago

@jgkamat

Ok, so this seems to fail almost every time on CI, but works 100% of the time locally. Because we're on docker, sshing into the CI environment is practically useless, since everything is done inside of a docker container created inside of the CI environment. I'm trying to run locally to see if I can repro, but I run into the following:

as part of our install script, I delete a symlink (https://github.com/gtagency/buzzmobile/blob/master/install#L44) created as part of the virtualenv creation process. This works as part of the install-at-startup, and works when run on Circle, but when I run locally, runtests.sh fails because this symlink doesn't exist. If I skip the step, it fails moments later on something else. Any ideas?

jgkamat commented 7 years ago

Hmm, I can try to take a look at this on tuesday earliest.

You can definetly still ssh via the circleci ssh feature, it's just a bit more complicated. Ssh in, then run a docker ps to find the running container that's hanging, and then run docker exec -ti <container sha> bash and you should get a shell up.

I'm a little bit confused as to why the symlink would be existing on the remote but not local containers. Maybe there's something slightly different about the environment variables being passed in or something?

I definetly noticed this problem earlier, but I didn't try and debug it. I have a feeling it has to do with a clean install of buzzmobile (if you get past one run, it suddenly becomes fine). I'll get back to you soon and let you know what I find out.

joshuamorton commented 7 years ago

I can't, it fails with an error: Error response from daemon: Unsupported: Exec is not supported by the lxc driver, which led down a fun rabbit hole that was ultimately unsuccessful :/

Apparently Circle doesn't use normal docker.

joshuamorton commented 7 years ago

Alright, now to bamboozle everyone even more: I can't repro this on local docker container.

I ran, on my host machine,

  1. sudo docker create -i -t arbitrary_value bash. arbitrary_value here is the name of the docker image from buildbaseimage.sh I believe. Basically in our case its an empty docker container with the empty folder ~/catkin_ws/src/buzzmobile defined.
  2. sudo docker start -a -i <THE_HASH_PRINTED>

Then, in the docker image

  1. cd .. (to ~/catkins_ws/src)
  2. git clone https://github.com/gtagency/buzzmobile.git
  3. cd buzzmobile
  4. ./install (this succeeded, unlike trying to docker run last week) (!!?????)
  5. rosrun (failed, as expected)
  6. pytest (failed, as expected)
  7. source bin/activate
  8. ci_scripts/style
  9. ci_scripts/unittest (passed, did not hang)
  10. deactivate
  11. ci_scripts/unittest (passed, did not hang)
  12. ci_scripts/simulation (failed with ImportError: No module named googlemapskey as tracked in #112)

I'm fully bamboozled here, because it is quite clearly hanging on CI, but its not here. As an attempted fix I guess we could try to have the install step run git clone instead of anything else, but that doesn't seem any better.

jgkamat commented 7 years ago

Yup, I can't seem to reproduce it locally on a single machine (after one run).

Since it seems flakey (not all the time) and it happens on CI but not on local, this smells like a deadlock/race condition that's going on, which is somehow made worse when you have less parallization (CI only has 2 threads).

It's possible some subtle change to CircleCI caused this to start triggering now. I have no idea how ros works though, so it might not be related.

I think this a semi-recent change though, since d5aef13 builds fine (at least for one try, you could try rebuilding it to see if it happens consistently). It seems to fail once I merge master in though.

I actually noticed this back a long time ago but I thought it was something up with your testing suite, so I didn't comment.

I would try disabling one of the two tests your running (at a time) and see if that helps to help narrow down the problem. I would start with test_controller, since the test I ran just now seems to be hanging on that (but I don't know how pytest output works, so I might be wrong.

joshuamorton commented 7 years ago

I'm still leaning towards it being an issue in pyrostest, if only because https://circleci.com/gh/gtagency/buzzmobile/596 passes.

My guess would be that there's something that can spin if the context managers aren't being used correctly, but I'm not sure why exactly that would be the case.

joshuamorton commented 7 years ago

Ok nevermind. I added an additional test in https://github.com/gtagency/pyrostest/pull/26, and that works just fine. So yeah, this appears to be a weird CI issue.

joshuamorton commented 7 years ago

Added some new things to pyrostest that should mitigate this. In testing it appeared to make the tests fail early instead of hanging, so that's nice.