lakehq / sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
https://lakesail.com
Apache License 2.0
374 stars 11 forks source link

Implement distributed job stages #265

Closed linhr closed 1 week ago

linhr commented 2 weeks ago
  1. Implement the physical plans for shuffle read and shuffle write.
  2. Update the codec for physical plans.
  3. Implement pipelined shuffle. Define the data structure that can later be used to support blocking shuffle via local files or remote storage.
  4. Implement the logic to start and stop workers for each job. Right now each job owns a fixed number of workers, and the workers are stopped after the job finishes. Dynamic worker allocation and worker reuse will be future work.

The shuffle physical plans and the corresponding planner are inspired by the DataFusion Ballista and DataFusion Ray projects.

github-actions[bot] commented 1 week ago

Spark Test Report

Commit Information

Commit Revision Branch
After 34c7a2a refs/pull/265/merge
Before d8345f8 refs/heads/main

Test Summary

Suite Commit Failed Passed Skipped Warnings Time (s)
doctest-column After 4 29 3 6.33
Before 4 29 3 6.52
doctest-dataframe After 39 67 1 4 8.29
Before 39 67 1 4 8.41
doctest-functions After 188 215 6 8 12.46
Before 188 215 6 8 12.52
test-connect After 402 636 126 250 112.73
Before 402 636 126 250 113.93

Test Details

Error Counts ```text 633 Total 307 Total Unique -------- ---- ---------------------------------------------------------------------------------------------------------- 30 DocTestFailure 27 UnsupportedOperationException: map partitions 24 UnsupportedOperationException: co-group map 23 UnsupportedOperationException: group map 15 UnsupportedOperationException: streaming query manager command 14 UnsupportedOperationException: lambda function 10 UnsupportedOperationException: unsupported data source format: Some("text") 10 handle add artifacts 8 AssertionError: AnalysisException not raised 8 PySparkAssertionError: [DIFFERENT_PANDAS_DATAFRAME] DataFrames are not almost equal: 8 PythonException: 8 PythonException: KeyError: 0 8 UnsupportedOperationException: hint 7 AssertionError: False is not true 6 IllegalArgumentException: invalid argument: UDF function type must be Python UDF 6 UnsupportedOperationException: function in table factor 6 UnsupportedOperationException: write stream operation start 5 AnalysisException: Cannot cast to Decimal128(14, 7). Overflowing on NaN 5 UnsupportedOperationException: function: monotonically_increasing_id 4 AnalysisException: Error during planning: view not found: v 4 AnalysisException: Execution error: 'Utf8("INTERVAL '0 00:00:00.000123' DAY TO SECOND") = CAST(#1 AS... 4 PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd() is not implemented. 4 UnsupportedOperationException: function: window 4 UnsupportedOperationException: interval unit: day-time 4 UnsupportedOperationException: replace 4 UnsupportedOperationException: sample 4 UnsupportedOperationException: sample by 4 UnsupportedOperationException: unknown aggregate function: hll_sketch_agg 4 UnsupportedOperationException: unpivot 3 AnalysisException: Error during planning: Table function 'range' not found 3 AnalysisException: Error during planning: The expression to get an indexed field is only valid for `... 3 AnalysisException: Execution error: Date part 'D' not supported 3 AssertionError: "[('a', [('b', 'c')])]" != "{'a': {'b': 'c'}}" 3 AssertionError: "{'col1': 1, 'struct(John, 30, struct(valu[87 chars]0}}}" != "Row(col1=1, col2=Row(c... 3 AssertionError: AnalysisException not raised by 3 AssertionError: Lists differ: [Row(a=1, a=1, b='x')] != [Row(a=1, b='x')] 3 AssertionError: PythonException not raised 3 IllegalArgumentException: expected value at line 1 column 1 3 SparkRuntimeException: Internal error: UDF returned a different number of rows than expected. Expect... 3 UnsupportedOperationException: crosstab 3 UnsupportedOperationException: function: input_file_name 3 UnsupportedOperationException: function: pmod 3 UnsupportedOperationException: function: ~ 3 UnsupportedOperationException: handle analyze input files 3 UnsupportedOperationException: to schema 3 ValueError: Converting to Python dictionary is not supported when duplicate field names are present 2 AnalysisException: Cannot cast from struct to other types except struct 2 AnalysisException: Cannot cast list to non-list data types 2 AnalysisException: Error during planning: The expression to get an indexed field is only valid for `... 2 AnalysisException: Error during planning: cannot resolve attribute: ObjectName([Identifier("a"), Ide... 2 AnalysisException: Error during planning: two values expected: [Alias(Alias { expr: Column(Column { ... 2 AnalysisException: Execution error: map requires all value types to be the same 2 AnalysisException: Invalid or Unsupported Configuration: Could not find config namespace "spark" 2 AssertionError 2 AssertionError: "TABLE_OR_VIEW_NOT_FOUND" does not match "Error during planning: cannot resolve attr... 2 AssertionError: Lists differ: [Row([22 chars](key=1, value='1'), Row(key=10, value='10'), R[2402 cha... 2 AssertionError: Lists differ: [Row(a=1, a=1, b=2)] != [Row(a=1, b=2)] 2 AssertionError: Lists differ: [Row(key='input', value=1.0)] != [Row(key='avg', value=1.0), Row(key='... 2 AssertionError: True is not false 2 IllegalArgumentException: invalid argument: empty data source paths 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: id at Line: 1, Col... 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: id at Line: 1, Col... 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: id at Line: 3, Col... 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: id at Line: 5, Col... 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: t at Line: 3, Colu... 2 IllegalArgumentException: invalid argument: sql parser error: Expected: ), found: v at Line: 1, Colu... 2 IllegalArgumentException: must either specify a row count or at least one column 2 SparkRuntimeException: Internal error: start_from index out of bounds. 2 UnsupportedOperationException: Only literal expr are supported in Python UDTFs for now, got expr: ma... 2 UnsupportedOperationException: Only literal expr are supported in Python UDTFs for now, got expr: na... 2 UnsupportedOperationException: Only literal expr are supported in Python UDTFs for now, got expr: ra... 2 UnsupportedOperationException: Only literal expr are supported in Python UDTFs for now, got expr: sp... 2 UnsupportedOperationException: approx quantile 2 UnsupportedOperationException: collect metrics 2 UnsupportedOperationException: decimal literal with precision or scale 2 UnsupportedOperationException: freq items 2 UnsupportedOperationException: function: bitmap_bit_position 2 UnsupportedOperationException: function: crc32 2 UnsupportedOperationException: function: dayofweek 2 UnsupportedOperationException: function: encode 2 UnsupportedOperationException: function: format_number 2 UnsupportedOperationException: function: from_csv 2 UnsupportedOperationException: function: from_json 2 UnsupportedOperationException: function: inline 2 UnsupportedOperationException: function: map_from_arrays 2 UnsupportedOperationException: function: sec 2 UnsupportedOperationException: function: shiftrightunsigned 2 UnsupportedOperationException: function: timestamp_seconds 2 UnsupportedOperationException: handle analyze same semantics 2 UnsupportedOperationException: pivot 2 UnsupportedOperationException: position with 3 arguments is not supported yet 2 UnsupportedOperationException: rebalance partitioning by expression 2 UnsupportedOperationException: tail 2 UnsupportedOperationException: unknown aggregate function: collect_set 2 UnsupportedOperationException: unresolved regex 2 UnsupportedOperationException: unsupported data source format: Some("orc") 2 UnsupportedOperationException: user defined data type should only exist in a field 2 handle artifact statuses 2 received metadata size exceeds hard limit (19620 vs. 16384); :status:42B content-type:60B grpc-stat... 1 AnalysisException: Cannot cast string 'abc' to value of Float64 type 1 AnalysisException: Cannot cast value 'abc' to value of Boolean type 1 AnalysisException: Error during planning: Execution error: User-defined coercion failed with Interna... 1 AnalysisException: Error during planning: Inconsistent data type across values list at row 1 column ... 1 AnalysisException: Error during planning: No function matches the given name and argument types 'NTH... 1 AnalysisException: Error during planning: cannot resolve attribute: ObjectName([Identifier("a"), Ide... 1 AnalysisException: Error during planning: three values expected: [Alias(Alias { expr: Column(Column ... 1 AnalysisException: Error during planning: three values expected: [Literal(Int32(1)), Literal(Int32(3... 1 AnalysisException: Error during planning: two values expected: [Alias(Alias { expr: Column(Column { ... 1 AnalysisException: Error during planning: view not found: tab2 1 AnalysisException: Execution error: 'Utf8("1970-01-01 00:00:00") = CAST(#1 AS Utf8) AS dt AS (1970-0... 1 AnalysisException: Execution error: 'Utf8("2012-02-02 02:02:02") = CAST(?table?.#0 AS Utf8) AS a AS ... 1 AnalysisException: Execution error: 'Utf8("INTERVAL '0 00:00:00.000123' DAY TO SECOND") = CAST(#1 AS... 1 AnalysisException: Execution error: Error parsing timestamp from '2023-01-01' using format '%d-%m-%Y... 1 AnalysisException: Execution error: Unable to find factory for TEXT 1 AnalysisException: Execution error: array_reverse does not support type 'Utf8'. (+1) 1 AnalysisException: Invalid or Unsupported Configuration: could not find config namespace for key "ig... 1 AnalysisException: Invalid or Unsupported Configuration: could not find config namespace for key "li... 1 AnalysisException: No field named tbl."#2". Valid fields are tbl."#3". 1 AnalysisException: Schema contains duplicate unqualified field name "#0" 1 AssertionError: "2000000" does not match "Internal error: raise_error expects a single UTF-8 string ... (+1) 1 AssertionError: "Database 'memory:317462d9-3429-4d9b-b985-67e991e07b9b' dropped." does not match "in... (+1) 1 AssertionError: "Database 'memory:adda2b0e-5f54-4825-bbcd-97107969b779' dropped." does not match "in... 1 AssertionError: "Exception thrown when converting pandas.Series" does not match " 1 AssertionError: "Exception thrown when converting pandas.Series" does not match "expected value at l... 1 AssertionError: "PickleException" does not match " 1 AssertionError: "UDTF_ARROW_TYPE_CAST_ERROR" does not match " 1 AssertionError: "[('a', 'b')]" != "{'a': 'b'}" 1 AssertionError: "foobar" does not match "Internal error: raise_error expects a single UTF-8 string a... 1 AssertionError: "timestamp values are not equal (timestamp='1969-01-01 09:01:01+08:00': data[0][1]='... 1 AssertionError: '+---[17 chars]-----+\n| x|\n+--------[132 chars]-+\n' != '+-... 1 AssertionError: 2 != 3 1 AssertionError: BinaryType() != NullType() 1 AssertionError: Exception not raised 1 AssertionError: Lists differ: [Row([14 chars] _c1=25, _c2='I am Hyukjin\n\nI love Spark!'),[86 chars... 1 AssertionError: Lists differ: [Row([22 chars]e(2018, 12, 31, 16, 0), aware=datetime.datetim[16 chars... 1 AssertionError: Lists differ: [Row([259 chars]681098, ln(id)=1.0986122886681098, struct(id, [975 cha... 1 AssertionError: Lists differ: [Row([49 chars] 1), internal_value=-31532339000000000), Row(i[225 char... 1 AssertionError: Lists differ: [Row(key='0'), Row(key='1'), Row(key='10'), Ro[1439 chars]99')] != [Ro... 1 AssertionError: Lists differ: [Row(name='Andy', age=30), Row(name='Justin', [34 chars]one)] != [Row(... 1 AssertionError: Row(point='[1.0, 2.0]', pypoint='[3.0, 4.0]') != Row(point='(1.0, 2.0)', pypoint='[3... 1 AssertionError: Row(res="[('personal', [('name', 'John'), ('city', 'New York')])]") != Row(res="{'pe... 1 AssertionError: StorageLevel(False, True, True, False, 1) != StorageLevel(False, False, False, False... 1 AssertionError: Struc[30 chars]estampType(), True), StructField('val', IntegerType(), True)]) != Str... 1 AssertionError: Struc[32 chars]e(), False), StructField('b', DoubleType(), Fa[158 chars]ue)]) != Str... 1 AssertionError: Struc[40 chars]ue), StructField('val', ArrayType(DoubleType(), False), True)]) != St... 1 AssertionError: YearMonthIntervalType(0, 1) != YearMonthIntervalType(0, 0) 1 AssertionError: [1.0, 2.0] != ExamplePoint(1.0,2.0) 1 AssertionError: {} != {'max_age': 5} 1 AttributeError: 'DataFrame' object has no attribute '_ipython_key_completions_' 1 AttributeError: 'DataFrame' object has no attribute '_joinAsOf' (+1) 1 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp0lza60kf' (+1) 1 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1v2qfndd' 1 IllegalArgumentException: 83140 is too large to store in a Decimal128 of precision 4. Max is 9999 1 IllegalArgumentException: invalid argument: from_unixtime with format is not supported yet 1 IllegalArgumentException: invalid argument: sql parser error: Expected: (, found: AS at Line: 1, Col... 1 IllegalArgumentException: invalid argument: sql parser error: Expected: (, found: EOF 1 IllegalArgumentException: number of columns(0) must match number of fields(1) in schema 1 KeyError: 'max' 1 PySparkNotImplementedError: [NOT_IMPLEMENTED] foreach() is not implemented. 1 PySparkNotImplementedError: [NOT_IMPLEMENTED] foreachPartition() is not implemented. 1 PySparkNotImplementedError: [NOT_IMPLEMENTED] localCheckpoint() is not implemented. 1 PySparkNotImplementedError: [NOT_IMPLEMENTED] sparkContext() is not implemented. 1 PySparkNotImplementedError: [NOT_IMPLEMENTED] toJSON() is not implemented. 1 PythonException: ArrowTypeError: ("Expected dict key of type str or bytes, got 'int'", 'Conversion ... 1 PythonException: AttributeError: 'NoneType' object has no attribute 'partitionId' 1 PythonException: AttributeError: 'list' object has no attribute 'x' 1 PythonException: AttributeError: 'list' object has no attribute 'y' 1 PythonException: TypeError: 'NoneType' object is not subscriptable 1 QueryExecutionException: Compute error: concat requires input of at least one array 1 QueryExecutionException: Json error: Not valid JSON: EOF while parsing a list at line 1 column 1 1 QueryExecutionException: Json error: Not valid JSON: expected value at line 1 column 2 1 SparkRuntimeException: External error: Arrow error: Invalid argument error: column types must match ... 1 SparkRuntimeException: External error: This feature is not implemented: physical plan is not yet imp... 1 SparkRuntimeException: Internal error: Failed to decode from base64: Invalid padding. 1 SparkRuntimeException: Internal error: UDF returned a different number of rows than expected. Expect... 1 SparkRuntimeException: Optimizer rule 'optimize_projections' failed 1 SparkRuntimeException: expand_wildcard_rule 1 SparkRuntimeException: type_coercion (+1) 1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (+1) 1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (+1) 1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (+1) 1 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... 1 UnsupportedOperationException: COUNT DISTINCT with multiple arguments 1 UnsupportedOperationException: Insert into not implemented for this table (+1) 1 UnsupportedOperationException: Physical plan does not support logical expression AggregateFunction(A... 1 UnsupportedOperationException: PlanNode::IsCached 1 UnsupportedOperationException: SQL show functions 1 UnsupportedOperationException: Unsupported statement: SHOW DATABASES 1 UnsupportedOperationException: bucketing 1 UnsupportedOperationException: call function 1 UnsupportedOperationException: deduplicate within watermark 1 UnsupportedOperationException: function exists 1 UnsupportedOperationException: function: aes_decrypt 1 UnsupportedOperationException: function: aes_encrypt 1 UnsupportedOperationException: function: array_insert 1 UnsupportedOperationException: function: array_sort 1 UnsupportedOperationException: function: arrays_zip 1 UnsupportedOperationException: function: bin 1 UnsupportedOperationException: function: bit_count 1 UnsupportedOperationException: function: bit_get 1 UnsupportedOperationException: function: bitmap_bucket_number 1 UnsupportedOperationException: function: bitmap_count 1 UnsupportedOperationException: function: bround 1 UnsupportedOperationException: function: conv 1 UnsupportedOperationException: function: convert_timezone 1 UnsupportedOperationException: function: csc 1 UnsupportedOperationException: function: date_from_unix_date 1 UnsupportedOperationException: function: dayofmonth 1 UnsupportedOperationException: function: dayofyear 1 UnsupportedOperationException: function: decode 1 UnsupportedOperationException: function: e 1 UnsupportedOperationException: function: elt 1 UnsupportedOperationException: function: format_string 1 UnsupportedOperationException: function: from_utc_timestamp 1 UnsupportedOperationException: function: getbit 1 UnsupportedOperationException: function: hour 1 UnsupportedOperationException: function: inline_outer 1 UnsupportedOperationException: function: java_method 1 UnsupportedOperationException: function: json_object_keys 1 UnsupportedOperationException: function: json_tuple 1 UnsupportedOperationException: function: last_day 1 UnsupportedOperationException: function: localtimestamp 1 UnsupportedOperationException: function: make_dt_interval 1 UnsupportedOperationException: function: make_interval 1 UnsupportedOperationException: function: make_timestamp 1 UnsupportedOperationException: function: make_timestamp_ltz 1 UnsupportedOperationException: function: make_timestamp_ntz 1 UnsupportedOperationException: function: make_ym_interval 1 UnsupportedOperationException: function: map_concat 1 UnsupportedOperationException: function: map_from_entries 1 UnsupportedOperationException: function: mask 1 UnsupportedOperationException: function: minute 1 UnsupportedOperationException: function: months_between 1 UnsupportedOperationException: function: next_day 1 UnsupportedOperationException: function: parse_url 1 UnsupportedOperationException: function: printf 1 UnsupportedOperationException: function: quarter 1 UnsupportedOperationException: function: reflect 1 UnsupportedOperationException: function: regexp_count 1 UnsupportedOperationException: function: regexp_extract 1 UnsupportedOperationException: function: regexp_extract_all 1 UnsupportedOperationException: function: regexp_instr 1 UnsupportedOperationException: function: regexp_substr 1 UnsupportedOperationException: function: schema_of_csv 1 UnsupportedOperationException: function: schema_of_json 1 UnsupportedOperationException: function: second 1 UnsupportedOperationException: function: sentences 1 UnsupportedOperationException: function: session_window 1 UnsupportedOperationException: function: sha 1 UnsupportedOperationException: function: sha1 1 UnsupportedOperationException: function: soundex 1 UnsupportedOperationException: function: spark_partition_id 1 UnsupportedOperationException: function: split 1 UnsupportedOperationException: function: stack 1 UnsupportedOperationException: function: str_to_map 1 UnsupportedOperationException: function: timestamp_micros 1 UnsupportedOperationException: function: timestamp_millis 1 UnsupportedOperationException: function: to_char 1 UnsupportedOperationException: function: to_csv 1 UnsupportedOperationException: function: to_json 1 UnsupportedOperationException: function: to_number 1 UnsupportedOperationException: function: to_unix_timestamp 1 UnsupportedOperationException: function: to_utc_timestamp 1 UnsupportedOperationException: function: to_varchar 1 UnsupportedOperationException: function: try_add 1 UnsupportedOperationException: function: try_aes_decrypt 1 UnsupportedOperationException: function: try_divide 1 UnsupportedOperationException: function: try_element_at 1 UnsupportedOperationException: function: try_multiply 1 UnsupportedOperationException: function: try_subtract 1 UnsupportedOperationException: function: try_to_binary 1 UnsupportedOperationException: function: try_to_number 1 UnsupportedOperationException: function: try_to_timestamp 1 UnsupportedOperationException: function: typeof 1 UnsupportedOperationException: function: unix_date 1 UnsupportedOperationException: function: unix_micros 1 UnsupportedOperationException: function: unix_millis 1 UnsupportedOperationException: function: unix_seconds 1 UnsupportedOperationException: function: url_decode 1 UnsupportedOperationException: function: url_encode 1 UnsupportedOperationException: function: weekday 1 UnsupportedOperationException: function: weekofyear 1 UnsupportedOperationException: function: width_bucket 1 UnsupportedOperationException: function: xpath 1 UnsupportedOperationException: function: xpath_boolean 1 UnsupportedOperationException: function: xpath_double 1 UnsupportedOperationException: function: xpath_float 1 UnsupportedOperationException: function: xpath_int 1 UnsupportedOperationException: function: xpath_long 1 UnsupportedOperationException: function: xpath_number 1 UnsupportedOperationException: function: xpath_short 1 UnsupportedOperationException: function: xpath_string 1 UnsupportedOperationException: handle analyze is local 1 UnsupportedOperationException: handle analyze semantic hash 1 UnsupportedOperationException: list functions 1 UnsupportedOperationException: unknown aggregate function: bitmap_or_agg 1 UnsupportedOperationException: unknown aggregate function: count_if 1 UnsupportedOperationException: unknown aggregate function: count_min_sketch 1 UnsupportedOperationException: unknown aggregate function: grouping_id 1 UnsupportedOperationException: unknown aggregate function: histogram_numeric 1 UnsupportedOperationException: unknown aggregate function: percentile 1 UnsupportedOperationException: unknown aggregate function: try_avg 1 UnsupportedOperationException: unknown aggregate function: try_sum 1 UnsupportedOperationException: unknown function: distributed_sequence_id 1 UnsupportedOperationException: unknown function: product 1 ValueError: The column label 'struct' is not unique. 1 internal error: unknown attribute in plan 239: a (-1) 0 AnalysisException: Invalid or Unsupported Configuration: could not find config namespace for key "ig... (-1) 0 AssertionError: "Database 'memory:58e093b4-80f2-4320-b2a6-20fbeaa400da' dropped." does not match "in... (-1) 0 AssertionError: "Database 'memory:aa3dd654-420a-43ce-9848-1227781d5957' dropped." does not match "in... (-1) 0 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpfss93iao' (-1) 0 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmptve67jk7' (-1) 0 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (-1) 0 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (-1) 0 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (-1) 0 UnsupportedOperationException: Aggregate can not be used as a sliding accumulator because `retract_b... (-1) 0 UnsupportedOperationException: Physical plan does not support logical expression AggregateFunction(A... ```
Passed Tests Diff (empty)