Closed seankao-az closed 1 month ago
LIMIT 10 is not the same as Rare. To achieve the desired result, we should use a window function to limit the output to 10 records per group.
SELECT
request,
clientip,
cnt
FROM (
SELECT
request,
clientip,
COUNT(*) AS cnt,
ROW_NUMBER() OVER (PARTITION BY request, clientip ORDER BY cnt ASC) AS rn
FROM
myglue_test.default.http_logs
GROUP BY
request, clientip
) AS RankedLogs
WHERE
rn <= 10
ORDER BY
cnt ASC;
Fail to analyze query. Cause: [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_WINDOW] The feature is not supported: Referencing a lateral column alias `cnt` in window expression "row_number() OVER (PARTITION BY request, clientip ORDER BY lateralAliasReference(cnt) ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)"
modifying the query to:
SELECT
request,
clientip,
cnt
FROM (
SELECT
request,
clientip,
COUNT(*) AS cnt,
ROW_NUMBER() OVER (PARTITION BY request, clientip ORDER BY COUNT(*) ASC) AS rn
FROM
myglue_test.default.http_logs
GROUP BY
request, clientip
) AS RankedLogs
WHERE
rn <= 10
ORDER BY
cnt ASC;
results in:
{
"status": "FAILED",
"error": "{\"Message\":\"Spark exception. Cause: Job aborted due to stage failure: Total size of serialized results of 12 tasks (1905.8 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)\"}"
}
The physical plan being:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(5) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false, true) AS value#35]
+- MapPartitions org.apache.spark.sql.Dataset$$Lambda$3910/0x00007f4501250f10@10b46c72, obj#34: java.lang.String
+- DeserializeToObject createexternalrow(request#15.toString, clientip#14.toString, staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cnt#8L, true, false, true), StructField(request,StringType,true), StructField(clientip,StringType,true), StructField(cnt,LongType,false)), obj#33: org.apache.spark.sql.Row
+- *(4) Sort [cnt#8L ASC NULLS FIRST], true, 0
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 1
+- Exchange rangepartitioning(cnt#8L ASC NULLS FIRST, 1000), ENSURE_REQUIREMENTS, [plan_id=205]
+- *(3) Project [request#15, clientip#14, cnt#8L]
+- *(3) Filter (rn#9 <= 10)
+- Window [row_number() windowspecdefinition(request#15, clientip#14, _w0#21L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#9], [request#15, clientip#14], [_w0#21L ASC NULLS FIRST]
+- WindowGroupLimit [request#15, clientip#14], [_w0#21L ASC NULLS FIRST], row_number(), 10, Final
+- *(2) Sort [request#15 ASC NULLS FIRST, clientip#14 ASC NULLS FIRST, _w0#21L ASC NULLS FIRST], false, 0
+- *(2) HashAggregate(keys=[request#15, clientip#14], functions=[count(1)], output=[request#15, clientip#14, cnt#8L, _w0#21L], schema specialized)
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 0
+- Exchange hashpartitioning(request#15, clientip#14, 1000), ENSURE_REQUIREMENTS, [plan_id=97]
+- *(1) HashAggregate(keys=[request#15, clientip#14], functions=[partial_count(1)], output=[request#15, clientip#14, count#37L], schema specialized)
+- *(1) Project [clientip#14, request#15]
+- FileScan json spark_catalog.default.http_logs[clientip#14,request#15,year#18,month#19,day#20] Batched: false, DataFilters: [], Format: JSON, Location: CatalogFileIndex(1 paths)[s3:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<clientip:string,request:string>
Now the plan looks way different with the window function.
But regardless, same failure can happen to SQL as well. Not a PPL rare
command issue. And it can be solved by scaling up driver node or changing spark.driver.maxResultSize as error message suggests
What is the bug? Running
source=myglue_test.default.http_logs | rare request by clientip
gets error:However, an equivalent SQL query:
select count(*) as cnt, request, clientip from myglue_test.default.http_logs group by request, clientip order by cnt asc limit 10
could get the result correctly:SQL query result
``` { "status": "SUCCESS", "schema": [ { "name": "cnt", "type": "long" }, { "name": "request", "type": "string" }, { "name": "clientip", "type": "string" } ], "datarows": [ [ 1, "GET /images/102321.gif HTTP/1.0", "120.125.16.0" ], [ 1, "GET /images/102327.gif HTTP/1.0", "183.200.6.0" ], [ 1, "GET /images/102321.gif HTTP/1.0", "89.57.14.0" ], [ 1, "GET /images/102321.gif HTTP/1.0", "147.116.16.0" ], [ 1, "GET /images/102321.gif HTTP/1.1", "224.58.14.0" ], [ 1, "GET /images/102321.gif HTTP/1.0", "178.72.4.0" ], [ 1, "GET /english/images/comp_bu_stage1n.gif HTTP/1.0", "195.128.16.0" ], [ 1, "GET /english/images/comp_bu_stage1n.gif HTTP/1.0", "26.52.13.0" ], [ 1, "GET /images/102321.gif HTTP/1.1", "77.76.14.0" ], [ 1, "GET /english/images/team_bu_roster_on.gif HTTP/1.1", "142.58.14.0" ] ], "total": 10, "size": 10 } ```The
LIMIT 10
clause might cause a difference in SQL query, but I add it becauserare
command defaults to size 10 as wellPhysical plan for PPL query:
Physical plan for SQL query:
What is your host/environment?