opensearch-project / opensearch-java

Java Client for OpenSearch
Apache License 2.0
115 stars 181 forks source link

[BUG] ApacheHttpClient5Transport hangs when content is too large #1092

Open hendryluk opened 1 month ago

hendryluk commented 1 month ago

What is the bug?

When using java client with ApacheHttpClient5Transport, large requests that exceeds AWS network limit (e.g. 10MB on m6g.large.search) will cause the request to hang and permanently blocks the thread. In actuality, AWS responds with 400 (or 413 if the content-length header is present in the request, which is missing in this case, but that's another issue).

This bug only affects ApacheHttpClient5Transport, but works correctly on the legacy RestClientTransport (i.e. throws an exception). It also works with curl (i.e. returns a 400 or 413 response).

How can one reproduce the bug?

Steps to reproduce the behavior.

  1. Create an AWS OpenSearch Service domain, with data nodes of an instance-type with 10MB network quota, e.g. m6g.large.search
  2. Make a bulk request >10MB, with the following junit test:

    public class BulkLimitTest {
    private static final String OPENSEARCH_URL = "https:/my-domain.ap-southeast-2.es.amazonaws.com";
    private static final String INDEX = "sheep-index";
    private OpenSearchClient client;
    
    @Before
    public void setup() throws IOException, URISyntaxException {
        ConnectionConfig connectionConfig = ConnectionConfig.custom()
                .setConnectTimeout(5L, TimeUnit.SECONDS)
                .setSocketTimeout(5, TimeUnit.SECONDS)
                .build();
        PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder
                .create()
                .setDefaultConnectionConfig(connectionConfig)
                .build();
        ApacheHttpClient5Transport transport = ApacheHttpClient5TransportBuilder.builder(HttpHost.create(OPENSEARCH_URL))
                .setHttpClientConfigCallback(builder -> builder.setConnectionManager(connectionManager))
                .build();
    
        client = new OpenSearchClient(transport);
        client.indices().delete(d -> d.index(INDEX).ignoreUnavailable(true));
        client.indices().create(r -> r.index(INDEX));
    }
    
    @Test
    public void testBulkLimit() throws IOException {
    //        final int SIZE_MB = 8; // ---> this works
        final int SIZE_MB = 10; // ---> this hangs
        BulkRequest request = generateRequest(SIZE_MB * 1024 * 1024);
        client.bulk(request);
    }
    
    private BulkRequest generateRequest(int size) {
        BulkRequest.Builder builder = new BulkRequest.Builder();
        final String SAMPLE = """
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed efficitur in metus quis lacinia.
            Nullam vel blandit lacus. Nam ornare purus nibh, et varius nunc finibus non. 
        """;
    
        int unitSize = SAMPLE.length();
        for (int i=0, s=0; s < size; i++, s+=unitSize) {
            String id = "sheep-" + i;
            builder.operations(o -> o.index(d -> d
                    .index(INDEX)
                    .id(id)
                    .document(Map.of("content", SAMPLE))));
        }
        return builder.build();
    }
    }

What is the expected behavior?

Throws OpenSearchException or ResponseException with 400 or 413 error.

What is the actual behavior?

The thread hangs forever. No exceptions, no timeouts.

What is your host/environment?

Do you have any screenshots?

No. See the junit code above.

Do you have any additional context?

Only affects ApacheHttpClient5Transport. Works correctly with RestClientTransport, curl, postman, etc, i.e. produces 400 or 413 errors.

dblock commented 1 month ago

Looks like a bug :( Do you think you can try to turn your repro into a (failing) unit test with a server mock response?

reta commented 1 month ago

The issue seems to be really related to quota setting (it is not reproducible locally at least):

Create an AWS OpenSearch Service domain, with data nodes of an instance-type with 10MB network quota, e.g. m6g.large.search

@dblock I may need your help here, wondering if we could have an AWS OpenSearch Service for a few hours to troubleshot the issue?

dblock commented 1 month ago

@reta I can help offline, yes.

hendryluk commented 1 month ago

Do you think you can try to turn your repro into a (failing) unit test with a server mock response?

@dblock Unfortunately it's not easily replicable with a mock server that returns a specific error response. The bug only occurs with the specific low level network behavior of AWS when we reach the network quota, in which case we'll get a specific sequence of events that's different from normal, e.g. it closes the request stream (preventing it from sending more data) before writing the status to the response.

So I think the best way would be to test this bug is with actual AWS.

dblock commented 1 month ago

Force-closing the request stream in an intercepted call is usually a solution, but I am talking without trying.

reta commented 1 month ago

@hendryluk @dblock interesting, the issue is related to HTTP protocol chosen:

Setting HttpVersionPolicy for the client fixes the issue, but I am not sure where exactly the problem is: OS 2.x does not support HTTP/2.0, why protocol is negotiated this way? gateway or LB in front?

ApacheHttpClient5TransportBuilder.builder(HttpHost.create(OPENSEARCH_URL))
     .setHttpClientConfigCallback(builder -> builder.setVersionPolicy(HttpVersionPolicy.FORCE_HTTP_1))
reta commented 1 month ago

A bit more on that: the behaviour vary if the payload is chunked (buffer size) or not (sent at once), 400 vs 413 is returned.

hendryluk commented 1 month ago

Thanks @reta for looking into the issue and for the workaround - we'll give that a try.