threefoldtecharchive / jumpscale9_core

Apache License 2.0
0 stars 2 forks source link

'filedescriptor out of range in select()' error in ssh client #181

Open chrisvdg opened 5 years ago

chrisvdg commented 5 years ago

Trying to install a zboot_ipmi_host service on bancadati returns the following image

This started happening after I ran a script that sequentially added hosts services to the zboot robot, without concurrency, somewhere about midway though the hosts. Since then it happens for all the hosts

zaibon commented 5 years ago

@chrisvdg can you tell me how many services are installed on this robot ? Or how much was there before it starts to give this error.

Also, did you get this error anytime you install the service, or is it random ?

chrisvdg commented 5 years ago

After it happened once, it keeps happening, I'm guessing its' the reply of the zboot router.

I lost the output of the script so hard to tell, somewhere around 150-ish I'd guess

zaibon commented 5 years ago

ok thanks ;-)

zaibon commented 5 years ago

I did some test about the robot itself, and I can create more then 200 services without any problem. Now the service you install is zeroboot_ipmi_host, this service create an ssh connection during the install and I guess it keeps it open during the lifetime of the service. Could be we hit the limit of file descriptor of the system. What is the ulimit of the machine you run the 0-robot ?

chrisvdg commented 5 years ago

image

zaibon commented 5 years ago

@chrisvdg maybe try to raise it and see if that improves it

chrisvdg commented 5 years ago

@zaibon, suggestions for the new setting?

chrisvdg commented 5 years ago

Doesn't seem it wants to take a custom value image

chrisvdg commented 5 years ago

Wiped my zos VM but now I get this from the get go.... image

Other services installed just fine... image

chrisvdg commented 5 years ago

I'll fully reset my VM, use v1.5.0 zos image and try again...

chrisvdg commented 5 years ago

On retrying https://gist.github.com/chrisvdg/0c821eb283b29ad0a9e80eb4f088d6a6

I got 49 services reporting to alerta (because of the network issue) so 49 ipmi_host services got successfully installed

Don't think it's the robot that's reaching it's filedescriptor's limit? image image

zaibon commented 5 years ago

After inspection of the logs of the robot server, error comes from the ssh library used by the zeroboot_ipmi_service.

Moving this issue to https://github.com/threefoldtech/jumpscale_core since the error comes from ssh client

zaibon commented 5 years ago

@rkhamis can you find someone to have a look on this one please.

chrisvdg commented 5 years ago

Found that the kubernetes pod running the zeroboot robot now to replace the VM has a much higher ulimit -n ( open files (-n) 1048576 ) but it would still fail around the same amount of services, I'm assuming it's the file limit of the zeroboot router which is 1024 Let me see if I can increase it

delandtj commented 5 years ago

do you pool your uci calls in 1 ssh session, or do you try to ssh 150 times at the same time?

chrisvdg commented 5 years ago

I don't think it's pooled

delandtj commented 5 years ago

either way, uci calls are locked, so it's certainly not in the wrt..., also this error tells it's a LOCAL RuntimeError with not enough fds

delandtj commented 5 years ago

ahhnoh.. it's an eco