robotpy / pyfrc

python3 library designed to make developing RobotPy-based code easier!
MIT License
50 stars 35 forks source link

Inconsistent 15second hang when initializing ssh #42

Closed Arhowk closed 8 years ago

Arhowk commented 8 years ago

In my tests with a roboRIO, there are usually two scenarios for deploy-

1) The robot code deploys pretty much fluidly, takes about 2seconds 2) The robot code will hang for about 15seconds between

Deploying to robot at: [hostname]

and

WPILib version on the robot is 2016.0.0

I've narrowed it down to the line (in cli_deploy.py)

controller.ssh(sshcmd)

Looking at that code, for Windows all it does is call plink.exe with the desired args (lest I am to be mistaken)

Firstly, is there any degree of reproducability to this issue? It may very well be a Windows-specific issue. I do not believe it to be a computer-specific issue because both my personal Windows 10 laptop and the team's Windows 7 laptops share the same issue.

Secondly, there is this comment before the ssh function

    # This sucks. We should be using paramiko here... 

I'm not well versed in ssh communcations... Is this something that can be changed quickly by another developer or should I find the time in changing it over? (tests are going quite well, I have a decent bit of down time)

virtuald commented 8 years ago

If the SSH command is hanging, I'd wager that you're giving it a hostname? (roborio-XXX-FRC.local). That's probably DNS resolving being slow, there's probably a way to fix it... you can test by running 'ping roborio-XXX-FRC.local' to see if it's slow. Or 'nslookup hostname'. If those are slow, thats your problem.

Paramiko is a python SSH client. I'd love to use it, but it requires compiled libraries, and I don't want to force users to have a compiler on their system to use RobotPy. So, not worth spending time on.

If you have down time, the big thing that needs to be done for RobotPy is validation/updating of CANTalon and CANJaguar objects for 2016. I'm hoping to get to it this weekend, but sooner is better. :)

Arhowk commented 8 years ago

I know we're using CANTalon (i doubt CANJaguar) but I don't think our srx's are in yet...

In every scenario, pinging took less than 1 second. In no way did it ever come CLOSE to the ping times shown by the compiler.

Could it be an optional dependency? alike how pygame support was added... those compile times are enough to make our team switch back over to our 8sec deploy times in Java

virtuald commented 8 years ago

There currently is no compilation happening.

What happens if you execute plink by hand to ssh into the robot? What happens?

virtuald commented 8 years ago

Also:

Validation means just making sure the the WPILib java source and the RobotPy source look reasonably similar, no hardware validation. Given RobotPy's limited personnel resources, most hardware validation happens during the season when someone uses it. :)

Another possibility is that one of the commands that are being executed by pyfrc are taking awhile to execute. I think if you pass -v, then it will print out what it's executing. Perhaps one of those is taking some time that it shouldn't.

virtuald commented 8 years ago

Another thought -- we could probably rewrite it so that it sftp's the files first, then executes all of the commands in one go. We could also combine the python commands into a single command line, as it takes about a second to launch the python interpreter on the RoboRIO.

If you can't get to it, I'll look at it tomorrow. Shouldn't be that bad.

Arhowk commented 8 years ago

Yes, but, rule of thumb is that code will never work if you don't test it. I'm with my team atm with access to the board, I'll see if I can move around some stuff and whatnot.

On Wed, Jan 13, 2016 at 12:27 AM, Dustin Spicuzza notifications@github.com wrote:

Another thought -- we could probably rewrite it so that it sftp's the files first, then executes all of the commands in one go. We could also combine the python commands into a single command line, as it takes about a second to launch the python interpreter on the RoboRIO.

If you can't get to it, I'll look at it tomorrow. Shouldn't be that bad.

— Reply to this email directly or view it on GitHub https://github.com/robotpy/pyfrc/issues/42#issuecomment-171174701.

Arhowk commented 8 years ago

Okay, I'm really skeptical about the mDNS resolver.. i dont know if its the router side or computer side but 2 different computers are both giving me the same result


C:\Users\illid_000>ping roborio-1684-frc.local

Pinging roborio-1684-frc.local [fe80::280:2fff:fe17:6109%13] with 32 bytes of data:
Destination host unreachable.
Destination host unreachable.
Destination host unreachable.
Destination host unreachable.

Ping statistics for fe80::280:2fff:fe17:6109%13:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),

C:\Users\illid_000>ping roborio-1684-frc.local

Pinging roborio-1684-frc.local [10.16.84.59] with 32 bytes of data:
Reply from 10.16.84.59: bytes=32 time=49ms TTL=63
Reply from 10.16.84.59: bytes=32 time=1ms TTL=63
Reply from 10.16.84.59: bytes=32 time=2ms TTL=63
Reply from 10.16.84.59: bytes=32 time=1ms TTL=63

Ping statistics for 10.16.84.59:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),

So, the first time I ping connecting to a freshly restarted robot it gives me some IPv6 address. This address is totally invalid. It doesn't point to anything. If I do the SAME ping again, it gives me an ipv4 address that represents what we had last year that has perfectly fine communication.

I suspect the issue with plink is that it's getting hung up on this invalid ipv6 address for a while thinking that it exists until finally giving up after 15seconds and re-resolving the address of the robot.

Is this reproducible in any way? besides on my end. Note that, while I'm using two different computers, I only have one development board.

On a side note, at this point it has nothing to do with pyfrc other than the fact that I'm deploying with pyfrc... I will continue further communications on Chief Delphi to get other's opinions

e/ installing Bonjour on my personal laptop (while disabling nimdns) encountered the same issues. I believe this to be an issue with the router.

I'm still curious if anyone else has the same symptoms.

virtuald commented 8 years ago

I'm on OSX/Linux most of the time, so I haven't done much (any) testing in Windows this year. Last year I did some testing on Windows, and didn't have any issues. But, they've changed around the mDNS stuff this year, so who knows.

I think I'll still rewrite the pyfrc code to do everything in one go though, it'll shave off a second or three from deploy, so I'll be happier long term.

Last year we had lots of problems with mDNS from OSX (and all of our students had school-issued macs), so we just set the RoboRIO to 10.xx.xx.2, and they deployed by IP and it worked fine.

virtuald commented 8 years ago

FWIW, I haven't had any mDNS issues with the 2016 image, but I'm on OSX/Linux as I said...

virtuald commented 8 years ago

Did you get this resolved?

Arhowk commented 8 years ago

Yes, sorry, re flashing the router's firmware (not team number) resolved the issue

On Sunday, January 17, 2016, Dustin Spicuzza notifications@github.com wrote:

Did you get this resolved?

— Reply to this email directly or view it on GitHub https://github.com/robotpy/pyfrc/issues/42#issuecomment-172417346.

virtuald commented 8 years ago

I wonder if it would make sense to try and resolve the address first, and then pass the remaining ssh/sftp calls an IP address instead? Could reduce time for those who are having mDNS issues.

Arhowk commented 8 years ago

Theres only one other SSH step per my testing (or atleast one other step that calls plink)

In addition, that step never lagged for me since the mDNS resolver used the last known good address after failing to find that IPv6 address

On Mon, Jan 18, 2016 at 12:40 PM, Dustin Spicuzza notifications@github.com wrote:

I wonder if it would make sense to try and resolve the address first, and then pass the remaining ssh/sftp calls an IP address instead? Could reduce time for those who are having mDNS issues.

— Reply to this email directly or view it on GitHub https://github.com/robotpy/pyfrc/issues/42#issuecomment-172602217.

virtuald commented 8 years ago

I'm having this same issue with the real router, and flashing the firmware didn't solve it. However, it's definitely not a python issue, as it can be demonstrated by executing the ssh command manually.

On OSX, I find that it doesn't cache the IP address, so there is a delay between each set of steps.

Arhowk commented 8 years ago

Are you using Bonjour? (I assume you are...) When I tested it with Bonjour on Windows, I noticed that there was a substantial amount of lga even from the time that it recieved the bad address. Are you just lagging or is the router showing the same ping symptoms (returning an ipv6 that doesn't exist)?