paullouisageneau / libjuice

JUICE is a UDP Interactive Connectivity Establishment library
Mozilla Public License 2.0
432 stars 82 forks source link

What does it mean when a connection can only be established if one side sends a message first? #259

Closed LaterBird closed 4 months ago

LaterBird commented 4 months ago

I think I understand the principle of P2P UDP hole punching. Here is my testing environment:

My understanding of UDP hole punching is as follows:

  1. Host A sends a packet to the public IP and port of Host B. This creates a mapping on NAT_A, so when packets from Host B reach NAT_A, they will be forwarded to Host A. Otherwise, packets from Host B to Host A will be discarded.
  2. Host B sends a packet to the public IP and port of Host A, creating a mapping on NAT_B. This allows packets from Host A to be forwarded to Host B when they reach NAT_B. Otherwise, packets from Host A to Host B will be discarded.
  3. Once both Host A and Host B send packets to each other's public IP and port, mappings are established on both NAT_A and NAT_B. Subsequently, packets sent between Host A and Host B can reach their destinations.

I use the following test code and perform the following operations:

  1. Run the program on both Host A and Host B. If I first manually enter Host B's candidate address on Host A, and then enter Host A's candidate address on Host B, P2P communication can be established.
  2. Run the program on both Host A and Host B. If I first manually enter Host A's candidate address on Host B, and then enter Host B's candidate address on Host A, P2P communication cannot be established.

NAT_A is in city A, NAT_B is in city B, and both hosts detect their NATs as port-restricted cone NATs. Here is the test code:

#include "juice/juice.h"
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#include <unistd.h> // for sleep

#define BUFFER_SIZE 4096

static juice_agent_t *agent;

static void on_state_changed(juice_agent_t *agent, juice_state_t state, void *user_ptr);
static void on_candidate(juice_agent_t *agent, const char *sdp, void *user_ptr);
static void on_gathering_done(juice_agent_t *agent, void *user_ptr);
static void on_recv(juice_agent_t *agent, const char *data, size_t size, void *user_ptr);

int main() {
    juice_set_log_level(JUICE_LOG_LEVEL_DEBUG);

    // Create agent
    juice_config_t config;
    memset(&config, 0, sizeof(config));

    // STUN server example
    config.stun_server_host = "stun.l.google.com";
    config.stun_server_port = 19302;

    config.cb_state_changed = on_state_changed;
    config.cb_candidate = on_candidate;
    config.cb_gathering_done = on_gathering_done;
    config.cb_recv = on_recv;
    config.user_ptr = NULL;

    agent = juice_create(&config);

    // Generate local description
    char sdp[JUICE_MAX_SDP_STRING_LEN];
    juice_get_local_description(agent, sdp, JUICE_MAX_SDP_STRING_LEN);
    printf("Local description:\n%s\n", sdp);

    // Wait for remote SDP input
    printf("Enter remote SDP (type 'end' on a new line to finish):\n");
    char remote_sdp[JUICE_MAX_SDP_STRING_LEN] = "";
    char line[JUICE_MAX_SDP_STRING_LEN];
    while (1) {
        fgets(line, sizeof(line), stdin);
        if (strcmp(line, "end\n") == 0) {
            break;
        }
        strcat(remote_sdp, line);
    }

    juice_set_remote_description(agent, remote_sdp);

    // Gather candidates
    juice_gather_candidates(agent);

    // Wait for remote candidate input (*************A add first, P2P is successful; B add first, P2P failed!!!!)
    printf("Enter remote candidate (type 'end' on a new line to finish):\n");
    while (1) {
        fgets(line, sizeof(line), stdin);
        if (strcmp(line, "end\n") == 0) {
            break;
        }
        juice_add_remote_candidate(agent, line);
    }

    // Simultaneously send an initial packet to the peer to trigger NAT mapping
    const char *init_message = "Initial packet";
    for (int i = 0; i < 5; ++i) {
        juice_send(agent, init_message, strlen(init_message));
        sleep(1);
    }

    // Main loop to keep the program running
    while (1) {
        sleep(1);
    }

    juice_destroy(agent);

    return 0;
}

static void on_state_changed(juice_agent_t *agent, juice_state_t state, void *user_ptr) {
    printf("State: %s\n", juice_state_to_string(state));

    if (state == JUICE_STATE_CONNECTED) {
        const char *message = "Hello from client";
        juice_send(agent, message, strlen(message));
    }
}

static void on_candidate(juice_agent_t *agent, const char *sdp, void *user_ptr) {
    printf("Candidate: %s\n", sdp);
}

static void on_gathering_done(juice_agent_t *agent, void *user_ptr) {
    printf("Gathering done\n");
}

static void on_recv(juice_agent_t *agent, const char *data, size_t size, void *user_ptr) {
    char buffer[BUFFER_SIZE];
    if (size > BUFFER_SIZE - 1)
        size = BUFFER_SIZE - 1;
    memcpy(buffer, data, size);
    buffer[size] = '\0';
    printf("Received: %s\n", buffer);
}

What could be the reason for this? Can you help me identify where the possible issues might be?

LaterBird commented 4 months ago

I came across a blog post describing that if a NAT device adopts a policy of discarding packets from unknown addresses, then there shouldn't be an issue with establishing a P2P connection. Here, "unknown addresses" refer to addresses where your own network has not initiated outbound communication. If the policy is not discarding but rather involves a blacklist mechanism, where the NAT device adds unknown addresses to a deny list upon receiving packets, then during P2P hole punching, packets sent to these unknown addresses will be remapped to a new port because they are already listed in the deny list. This situation essentially transforms the NAT into a symmetric NAT.

To handle this scenario, controlling the TTL (Time To Live) of packets ensures that mappings are first established on your own NAT device. Subsequently, using normal TTL values to send packets ensures successful hole punching.

Following this theory, I wrote my own testing code and indeed managed to establish connections. It seems that our libjuice does not handle this scenario, which resulted in the issue I observed during testing—only when A sends packets first can the connection be established. According to the theory above, if NAT_A employs a blacklist mechanism, this would explain why if B's packets arrive at NAT_A before A's, NAT_A adds B's public address to the blacklist. Then, when A subsequently sends packets to B's public address, they are mapped to a new port, preventing successful connection establishment.

Conclusion: Using TTL to first establish mappings within each respective NAT can avoid the situation I encountered. In my tests, I used a TTL value of 3 to establish mappings. Here is the link to the blog post where I found this information, specifically in item 11. https://rebootcat.com/2021/03/28/p2p_nat_traversal/

paullouisageneau commented 3 months ago

If the policy is not discarding but rather involves a blacklist mechanism, where the NAT device adds unknown addresses to a deny list upon receiving packets, then during P2P hole punching, packets sent to these unknown addresses will be remapped to a new port because they are already listed in the deny list. This situation essentially transforms the NAT into a symmetric NAT.

The core issue here seems to be that under some circumstances NAT A is endpoint-dependent (symmetric NAT), which means it is hard to hole-punch. This doesn't look like a blacklist to me, more like a DMZ setup interferring with the mapping for instance.

Then, when A subsequently sends packets to B's public address, they are mapped to a new port, preventing successful connection establishment.

This scenario is typical with endpoint-dependent NATs. It is handled by ICE with a peer reflexive candidate provided NAT B does endpoint-independent mapping and filtering (full cone or restricted cone NAT), so it can still connect in scenarios where NAT B is cooperative enough.

 // Simultaneously send an initial packet to the peer to trigger NAT mapping
    const char *init_message = "Initial packet";
    for (int i = 0; i < 5; ++i) {
        juice_send(agent, init_message, strlen(init_message));
        sleep(1);
    }

I think you misunderstand how ICE works. Application messages are not used for NAT or firewall traversal. The library does everything for you, and you must wait for connected state to send messages. juice_send() will always fail if the agent is not in connected/completed state (you don't check the return value here).

LaterBird commented 3 months ago

This doesn't look like a blacklist to me, more like a DMZ setup interferring with the mapping for instance.

My router's DMZ is turned off.

This scenario is typical with endpoint-dependent NATs. It is handled by ICE with a peer reflexive candidate provided NAT B does endpoint-independent mapping and filtering (full cone or restricted cone NAT), so it can still connect in scenarios where NAT B is cooperative enough.

The information detected by running STUN on the two hosts used for testing is as follows:

A host: stun stun.l.google.com:19302 STUN client version 0.97 Primary: Independent Mapping, Independent Filter, random port, will hairpin Return value is 0x000002

B host: stun stun.l.google.com:19302 STUN client version 0.96 Primary: Independent Mapping, Independent Filter, random port, will hairpin Return value is 0x000002

Based on the information I found online, it indicates that the NAT type for both hosts is Port Restricted Cone.

I think you misunderstand how ICE works. Application messages are not used for NAT or firewall traversal. The library does everything for you, and you must wait for connected state to send messages.

Thank you for pointing that out, I indeed misunderstood this part.

this is test log: test.log

Thank you for paying attention to my issue. @paullouisageneau

paullouisageneau commented 3 months ago

Thank you for the log. It matches what you observe but doesn't explain the behavior of NAT A.

The information detected by running STUN on the two hosts used for testing is as follows:

A host: stun stun.l.google.com:19302 STUN client version 0.97 Primary: Independent Mapping, Independent Filter, random port, will hairpin Return value is 0x000002

B host: stun stun.l.google.com:19302 STUN client version 0.96 Primary: Independent Mapping, Independent Filter, random port, will hairpin Return value is 0x000002

Based on the information I found online, it indicates that the NAT type for both hosts is Port Restricted Cone.

I assume you use this test client. If so, it looks unmaintained since last update was nearly 10 years ago now. The NAT test seems to assumes the STUN server supports NAT behavior discovery attributes like CHANGE-REQUEST and CHANGED-ADDRESS. These attributes were in the deprecated RFC 3489 but they have been removed in RFC 5389 and are now part of an extension (RFC 5780). In practice they are rarely supported because they require the server to have multiple IPv4 addresses. In particular, Google's STUN servers do not support them. I guess the client silently fails to run the proper test here so the result is probably meaningless.