neo4j-contrib / neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
https://neo4j.com/developer/spark/
Apache License 2.0
313 stars 111 forks source link

Investigate about possible useless Neo4j connections #224

Open utnaf opened 3 years ago

utnaf commented 3 years ago

While writing tests I came across this scenario:

  @Test
  def testComplexReturnStatementNoValues(): Unit = {
    val df = ss.read.format(classOf[DataSource].getName)
      .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
      .option("query",
        """MATCH (p:Person)-[b:BOUGHT]->(pr:Product)
          |RETURN id(p) AS personId, id(pr) AS productId, {quantity: b.quantity, when: b.when} AS map, "some string" as someString, {anotherField: "201", and: 1} as map2""".stripMargin)
      .option("schema.strategy", "string")
      .load()

    assertEquals(Seq("personId", "productId", "map", "someString", "map2"), df.columns.toSeq)
  }

Given that I'm 101% sure that the assertEquals is green, executing this causes this timeout error

java.lang.AssertionError: Timeout hit (30 seconds) while waiting for condition to match: 
Expected: <true>
     but: was <false>
Expected :<true>
Actual   :<false>

Connection log is:

For test testComplexReturnStatementNoValues => connections before: 2, after: 3

Including an action in the test (like df.count() make the whole thing work, no error anymore and the test is green.

Investigate if we have useless connection hanging that causes the problem or if it's test configuration issue.

moxious commented 3 years ago

Within the neo4j driver object it's possible to configure the size of the connection pool that it opens when you initialize it. If you don't configure this, I think you get something like 3-5 connections, the driver assuming that you'll issue multiple queries and so on.

If it is the case that neo4j operations are always single-threaded within a worker node, it might make sense to explicitly configure max connections to be 1 for all driver instances.

moxious commented 3 years ago

Related to this: I can get some very weird driver errors (not connector errors) when playing around with connection schemes.

For example, imagine any simple read query to the database, doesn't matter what.

The strange errors I'm seeing may be related to connection reuse in the worker node? I'm guessing. Not reporting this as a separate issue right now because I can't reliably reproduce it myself. But related to the ticket, some questions arise for me: