spcl / fmi

Function Message Interface (FMI): library for message-passing and collective communication for serverless functions.
https://mcopik.github.io/projects/fmi/
15 stars 10 forks source link

Evaluate TCPunch on Google Cloud #2

Open mcopik opened 1 year ago

mcopik commented 1 year ago

We have verified that TCPunch works on the AWS cloud. However, it has yet to be established if the implemented NAT hole punching will work on the Google cloud. This step is necessary to run TCP communication between two different functions.

We should first run this on two VMs to verify that the connection is established, and then try to establish TCP connection between a VM and a function.

PranayB003 commented 6 months ago

@mcopik I have completed the first part of this issue (establishing a connection between 2 VMs), and my work can be viewed in this repo. I'm working on the second part (establishing a connection between a VM and a function). Please do let me know if you require any changes or have suggestions!

PranayB003 commented 6 months ago

@mcopik I've tested communication between a serverless function and a VM. It seems serverless function services (Cloud Function and Cloud Run) on GCP only allow incoming traffic over HTTP, and on a single port. Due to this, the hole punching server's response never reaches the function, and it times out. The outgoing request from the function does reach the hole punching server though. I think we can get around this issue by letting the client specify which port and protocol it expects a response on while calling pair(). What do you think?

mcopik commented 5 months ago

@PranayB003 Thanks for the update! How did you arrive at the conclusion above? Is it the case that the function opens a connection to the hole punching server, sends a request, but never receives a reply from the server?

That would be a strange setting as it would effectively prevent making any HTTP requests from the function, e.g., to the database.

PranayB003 commented 5 months ago

@mcopik You're right, I came to that conclusion because the server received the client's request and responded back, but the client never got the response. I confirmed this through the logs, kindly refer to these images of the Cloud Run logs and the hole punching server's logs.

cloud run log hps log (vm)

Your comment about the function not being able to make HTTP requests has got me thinking too, logically speaking there should be a way to get back a response. I'm currently looking this up. Could you please tell me whether you faced any related issues when you first tried TCPunch on AWS? Any other advice is also greatly appreciated!

PranayB003 commented 5 months ago

@mcopik The cloud run instance does receive the reply from the hole punching server, I checked by enabling the debugging statements in TCPunch. Please find below the logs that show this:

Screenshot 2024-04-11 at 1 48 34 AM Screenshot 2024-04-11 at 1 48 06 AM

It seems the problem is that Cloud Run instances can make outgoing TCP connections (and subsequently send/receive messages on this connection) but cannot accept new incoming TCP connections (on arbitrary ports apart from the one that's open to HTTP requests), which is why the call to pair() keeps waiting to accept a connection from the peer VM and eventually times out.

mcopik commented 5 months ago

@PranayB003 In general, functions cannot accept incoming connections - that's why we need the hole punching :)

On AWS Lambda, we sometimes had issues with the robustness of the TCP connection but never had problems with creating the connection. The only important factor was that if you try the VM-TCP connection, the VM needs to have its security policies updated such that it allows all incoming connections on ports since our hole punching implementation was not restricted to any specific port selection.

You said that "but cannot accept new incoming TCP connections (on arbitrary ports apart from the one that's open to HTTP requests)" -> does it mean that you verified it works if you restrict port selection to the one already open? That might also not work if there's an HTTP server actively polling for new invocations (it might read the incoming TCP data), but it will work if the server does not poll while function is executing.

PranayB003 commented 5 months ago

@mcopik

The only important factor was that if you try the VM-TCP connection, the VM needs to have its security policies updated such that it allows all incoming connections on ports

Yep, I've done this in my evaluation too. However, GCP seems to be different from AWS in that the Cloud Run instance is unable to accept incoming connections even after the hole is punched (since the request/response to/from the hole-punching server was successful).

does it mean that you verified it works if you restrict port selection to the one already open?

Not yet, I was reading the docs and other sources to find out whether subsequent requests to the same Cloud Run instance IP (on the open HTTP port) would be:

I did not find any concrete mention of these mechanisms in the docs or elsewhere, so I'll have to just try it out practically. I have end-semester exams presently, so I haven't been able to devote time to this for the past week. I'll get back to working on this after 2nd May.