Why is LIMIT not allowed in spark query?

neo4j-contrib / neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

https://neo4j.com/developer/spark/

Apache License 2.0

312 stars 111 forks source link

Why is LIMIT not allowed in spark query? #639

Closed pascalwhoop closed 1 month ago

pascalwhoop commented 1 month ago

https://github.com/neo4j-contrib/neo4j-spark-connector/blob/98c6b9d0687a0f5dbb2b5559270ed387ac744c71/common/src/main/scala/org/neo4j/spark/util/Validations.scala#L278

it would be helpful to be able to test a query with 5-10 records before running the full thing. any reasoning why this is not permitted? I couldn't find anything while searching your docs

fbiville commented 1 month ago

Hello, the problem is that the query you configure is interpolated into a larger query template before being run. SKIP / LIMIT could have unintended consequences on the whole query or make it invalid (note that the validation is far from ideal as it may run into false positives quite easily but that's another story). If you want to test a sample first, I'd advise to use the limit method on the DataFrame.

pascalwhoop commented 1 month ago

That's a bit unfortunate since we'd have to load the entire DB into memory before then saying "give me only the first 10". Can you help me understand better why there is no way to pass a LIMIT specifically to the DB?

fbiville commented 1 month ago

@pascalwhoop you can pass it, but through the Dataframe limit method. The limit is then pushed down to the Cypher layer as an optimization.

pascalwhoop commented 1 month ago

Ah I see because you do not de-construct the Query string back into an internal representation first but send it as-is to the n4j instance, now I got you! This is a fallacy of my side then, Spark treats str queries the same as pyspark articulated queries, it always creates an internal representation first and then optimizes the entire query plan before executing. So as a spark developer, you then assume "I can articulate my query in any way" but there are differences here between the different APIs. Got ya!