neo4j / neo4j-java-driver

Neo4j Bolt driver for Java
Apache License 2.0
329 stars 155 forks source link

Message PULL .. cannot be handled by a session in the READY state #1346

Closed ikwattro closed 1 year ago

ikwattro commented 1 year ago

We have the following piece of driver usage :

@Test
    public void test_transaction_timeout() {
        try(Driver driver = GraphDatabase.driver("neo4j://neo4j-replica-001.xx.xxx", AuthTokens.basic("user", "pass"))) {
            try (Session session = driver.session()) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(15000, ChronoUnit.MILLIS))
                        .build();
                try (Transaction tx = session.beginTransaction(config)) {
                    tx.run("CALL apoc.util.sleep(30000)");
                    tx.commit();
                }
            }
        }
    }

One time on two, we receive different error messages

One message which makes sense :

org.neo4j.driver.exceptions.ClientException: The transaction has been terminated. Retry your operation in a new transaction, and you should see a successful result. The transaction has not completed within the specified timeout (dbms.transaction.timeout). You may want to retry with a longer timeout. 

The second is a bit more cryptic

org.neo4j.driver.exceptions.ClientException: Message 'PULL Map{qid -> Long(0), n -> Long(1000)}' cannot be handled by a session in the READY state.

The query is executed against a replica or a core ( result is same ) of a Neo4j 4.4.14 causal cluster ( 3 cores + 1 replica with SSR enabled ).

The following neo4j-java-driver versions produce the same result :

Is it something expected or is our piece of logic wrong ?

grzegorz-aniol commented 1 year ago

:+1:

grzegorz-aniol commented 1 year ago

I can confirm the issue. The error returned, after the timeout, seems to be random. I created unit test, with several different ways how the query can be run, with implicit and explicit transaction. Very often the query doesn't fail with expected Neo.ClientError.Transaction.TransactionTimedOut, but with different error e.g. Neo.ClientError.Request.Invalid with message e.g. Message 'PULL Map{qid -> Long(1), n -> Long(700)}' cannot be handled by a session in the READY state.

@Slf4j
public class TimeoutInTxTest {

    private static final String URL = "neo4j://........";
    private static final String DATABASE = ".......";
    private static final AuthToken AUTH = AuthTokens.basic(System.getenv("NEO4J_USER"), System.getenv("NEO4J_PASSWORD"));
    private static final int TIMEOUT = 3_000;
    private static final String TIMEOUT_ERROR_CODE = "Neo.ClientError.Transaction.TransactionTimedOut";

    // Test fails almost every time with error code: Neo.ClientError.Request.Invalid
    @RepeatedTest(value = 7)
    public void timeout_for_explicit_tx_1() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
                //session.run("create (p:Test) return p");
                try (Transaction tx = session.beginTransaction(config)) {
                    tx.run(format("CALL apoc.util.sleep(%d)", 2 * TIMEOUT));
                    tx.commit();
                }
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

    // This test fails every 2nd time (with Neo.ClientError.Request.Invalid)
    @RepeatedTest(value = 7)
    public void timeout_for_explicit_tx_2() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
                session.run("create (p:Test) return p");
                try (Transaction tx = session.beginTransaction(config)) {
                    tx.run(format("CALL apoc.util.sleep(%d)", 2 * TIMEOUT));
                    tx.commit();
                }
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

    // This test fails every 2nd time (with Neo.ClientError.Request.Invalid)
    @RepeatedTest(value = 7)
    public void timeout_for_explicit_tx_3() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
                session.run("create (p:Test) return p");
                try (Transaction tx = session.beginTransaction(config)) {
                    tx.run("UNWIND range(0,1000000) AS i RETURN i").stream().forEach(r -> {});
                    tx.commit();
                }
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

    // This test fails every 2nd time (with Neo.ClientError.Request.Invalid)
    @RepeatedTest(value = 7)
    public void timeout_for_explicit_tx_4() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
                try (Transaction tx = session.beginTransaction(config)) {
                    tx.run("create (p:Test) return p");
                    tx.run("UNWIND range(0,1000000) AS i RETURN i").stream().forEach(r -> {});
                    tx.commit();
                }
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

    // This test always fail with Neo.ClientError.Request.Invalid
    @RepeatedTest(value = 7)
    public void timeout_for_implicit_tx_1() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
//                session.run("create (p:Test) return p");
                session.executeWriteWithoutResult(ctx -> {
                    ctx.run(format("CALL apoc.util.sleep(%d)", 2 * TIMEOUT));
                }, config);
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

    // This test fails every 2nd time (with Neo.ClientError.Request.Invalid)
    @RepeatedTest(value = 7)
    public void timeout_for_implicit_tx_2() {
        try (Driver driver = GraphDatabase.driver(URL, AUTH)) {
            try (Session session = driver.session(SessionConfig.builder().withDatabase(DATABASE).build())) {
                TransactionConfig config = TransactionConfig.builder()
                        .withTimeout(Duration.of(TIMEOUT, ChronoUnit.MILLIS))
                        .build();
                session.run("create (p:Test) return p");
                session.executeWriteWithoutResult(ctx -> {
                    ctx.run(format("CALL apoc.util.sleep(%d)", 2 * TIMEOUT));
                }, config);
                Assertions.fail("Should fail with timeout");
            }
        } catch (ClientException ex) {
            log.error("{}, {}", ex.code(), ex.getMessage());
            assertThat(ex.code()).isEqualTo(TIMEOUT_ERROR_CODE);
        }
    }

}

Do you think it may be an issue with bolt protocol?

injectives commented 1 year ago

@ikwattro, thanks for reporting the issue. Unfortunately, I am struggling to reproduce the following error:

org.neo4j.driver.exceptions.ClientException: Message 'PULL Map{qid -> Long(0), n -> Long(1000)}' cannot be handled by a session in the READY state.

So far, I have setup a 3 node cluster using the neo4j:4.4.14-enterprise Docker image with SSR enabled. I tried hitting both the writer and the readers. It feels like I might be missing some crucial detail to reproduce.

injectives commented 1 year ago

@grzegorz-aniol, thanks for the tests. 👍

For some reason, I am getting different results:

It is probably worth noting I had console logging enabled as I was hoping to catch some network exchanges.

injectives commented 1 year ago

It would be good to establish how the Message 'PULL Map{qid -> Long(0), n -> Long(1000)}' cannot be handled error happens. If it is easy enough to reproduce, could you enable the driver debug logging (example: Config.builder().withLogging(Logging.console(Level.FINE)).build()) and share the log please?

grzegorz-aniol commented 1 year ago

I've been running these days exactly same unit tests, few times, and I cannot reproduce this issue anymore. What is strange, the cluster version is the same. I also tried with different driver versions (as described by @ikwattro), with same result.

injectives commented 1 year ago

Lets close this one. Please open a new issue or re-open this one if it becomes reproducible again.