jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

OVH / mlpack.org firewall issue #2378

Closed rcurtin closed 1 year ago

rcurtin commented 1 year ago

Hi there everyone,

I am trying to track down what appears to be a strange firewall issue that appears only on OVH binder notebook instances. I run the mlpack open-source machine learning library, and many of the examples in our examples repository first fetch data from datasets.mlpack.org. But I am finding specifically that when on an OVH instance (like e.g. a notebook running on 51.178.95.56), connections to datasets.mlpack.org (209.195.13.98) simply time out. I've checked the firewall configuration on datasets.mlpack.org and found no issues there; notebooks running on other non-OVH servers seem to be able to connect fine. It seems likely to me that there is some OVH firewall rule blocking access to datasets.mlpack.org.

For an easy reproduction, just start a binder instance that has a shell on OVH, then do something like wget datasets.mlpack.org/ and it will simply time out.

Could someone here help with that---or point me to the right place to get the issue resolved? Thanks so much!

(I originally posted this in the Gitter chat, but @consideRatio suggested I open an issue here instead.)

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

minrk commented 1 year ago

Thanks for reporting! This appears to be a problem with node-1 in the OVH cluster. I can reproduce the with non-user pods on node-1, and can connect to that host from other nodes from within user pods.

Unfortunately, we can't easily cordon node-1 since it's where the ingress controller has to be (for now).

https://github.com/jupyterhub/mybinder.org-deploy/pull/2379 should keep user pods away from node-1 while we work it out. @consideRatio can you have a look if that makes sense?

@mael-le-gal can you reboot node-1 to see if that fixes the issue?

mael-le-gal commented 1 year ago

@minrk I just rebooted node-1

minrk commented 1 year ago

@mael-le-gal thanks! The issue still appears, so there's something special about node-1 that's preventing egress to 209.195.13.98. Weirdly most other sites still work, and the same egress destination can be reached from other nodes.

betatim commented 1 year ago

Could it be that there is/was a lot of traffic from node-1 and it got ratelimited on the mlpack side (via a generic rate limiting rule)?

minrk commented 1 year ago

Hm, could be. But I think that's unlikely. I would expect all the nodes to have the same egress IP (not sure about that). I'm not really sure how to debug further.

rcurtin commented 1 year ago

On the mlpack side we don't have any ratelimiting support set up. The system is just some 1U thing I threw in a rack somewhere and manually administrate; no nice proxy or "advanced setup" of any sort in front of it. :) When I was playing with this issue, I disabled all iptables rules on mlpack.org temporarily just to double-check, but there was no change. I also went through all the iptables rules and didn't find any that would block node-1 on either port 80 or 443.

rcurtin commented 1 year ago

I tried this again today, with an outbound IP (from binder) of 51.68.77.249, and the request succeeded. Are there other OVH nodes I can check with? I tried a few times and always found myself with that outbound IP. I'd like to check again with 51.178.95.56 just to be sure the issue is resolved.

minrk commented 1 year ago

@rcurtin can you actually try with https://ovh2.mybinder.org ? We are in the process of deploying a whole new cluster for the OVH federation member, so if there are any issues specific to the current cluster, they should go away next week.

rcurtin commented 1 year ago

It seems like everything works from ovh2.mybinder.org. So, I guess, if the old cluster goes away next week, then we can resolve this then. :) Thanks for the help!

minrk commented 1 year ago

Thanks for testing, @rcurtin!