opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
20 stars 30 forks source link

[BUG]Add circuits breaker for FlintREPL #109

Closed YANG-DB closed 11 months ago

YANG-DB commented 1 year ago

What is the bug? In some cases flint queries just keep going without being timeout or circuit break

How can one reproduce the bug? From workbench run:

source = mys3.default.http_logs head 5

What is the expected behavior? Jobs should be timed-out / circuit break after a max running time

Do you have any screenshots?

The following job dosent stop running:

Sprak Config Params:

--class org.apache.spark.sql.FlintREPL --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider --conf spark.hadoop.aws.catalog.credentials.provider.factory.class=com.amazonaws.glue.catalog.metastore.STSAssumeRoleSessionCredentialsProviderFactory --conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.1.0-SNAPSHOT,org.opensearch:opensearch-spark-sql-application_2.12:0.1.0-SNAPSHOT,org.opensearch:opensearch-spark-ppl_2.12:0.1.0-SNAPSHOT --conf spark.jars.repositories=https://aws.oss.sonatype.org/content/repositories/snapshots --conf spark.datasource.flint.host=ec2-18-237-133-156.us-west-2.compute.amazonaws.com --conf spark.datasource.flint.port=9200 --conf spark.datasource.flint.scheme=http --conf spark.datasource.flint.auth=noauth --conf spark.datasource.flint.customAWSCredentialsProvider=com.amazonaws.emr.AssumeRoleAWSCredentialsProvider --conf spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions,org.opensearch.flint.spark.FlintSparkExtensions --conf spark.emr-serverless.driverEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.executorEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.hive.metastore.glue.role.arn=arn:aws:iam::924196221507:role/FlintOpensearchServiceRole --conf spark.sql.catalog.mys3=org.opensearch.sql.FlintDelegatingSessionCatalog --conf spark.flint.datasource.name=mys3 --conf spark.flint.job.requestIndex=.query_execution_request_mys3 --conf spark.flint.job.sessionId=NWpKUFBobmtiZ215czM= --conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=10 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64/ --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64/

Job error Log:

https://aws.oss.sonatype.org/content/repositories/snapshots added as a remote repository with the name: repo-1
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
org.opensearch#opensearch-spark-standalone_2.12 added as a dependency
org.opensearch#opensearch-spark-sql-application_2.12 added as a dependency
org.opensearch#opensearch-spark-ppl_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e89ff3b7-170e-4545-bd38-2304496ffbe0;1.0
    confs: [default]
    found org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT in repo-1
    found org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT in repo-1
    found org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT in repo-1
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-standalone_2.12/0.1.0-SNAPSHOT/opensearch-spark-standalone_2.12-0.1.0-20231025.180019-42.jar ...
    [SUCCESSFUL ] org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT!opensearch-spark-standalone_2.12.jar (1796ms)
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-sql-application_2.12/0.1.0-SNAPSHOT/opensearch-spark-sql-application_2.12-0.1.0-20231025.180031-35.jar ...
    [SUCCESSFUL ] org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT!opensearch-spark-sql-application_2.12.jar (292ms)
downloading https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/opensearch-spark-ppl_2.12/0.1.0-SNAPSHOT/opensearch-spark-ppl_2.12-0.1.0-20231025.180026-19.jar ...
    [SUCCESSFUL ] org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT!opensearch-spark-ppl_2.12.jar (229ms)
:: resolution report :: resolve 7942ms :: artifacts dl 2324ms
    :: modules in use:
    org.opensearch#opensearch-spark-ppl_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
    org.opensearch#opensearch-spark-sql-application_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
    org.opensearch#opensearch-spark-standalone_2.12;0.1.0-SNAPSHOT from repo-1 in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e89ff3b7-170e-4545-bd38-2304496ffbe0
    confs: [default]
    3 artifacts copied, 0 already retrieved (37949kB/39ms)
23/10/27 06:12:44 INFO HiveConf: Found configuration file file:/etc/spark/conf/hive-site.xml
23/10/27 06:12:44 INFO SparkContext: Running Spark version 3.3.2-amzn-0
23/10/27 06:12:44 INFO ResourceUtils: ==============================================================
23/10/27 06:12:44 INFO ResourceUtils: No custom resources configured for spark.driver.
23/10/27 06:12:44 INFO ResourceUtils: ==============================================================
23/10/27 06:12:44 INFO SparkContext: Submitted application: FlintREPL$
23/10/27 06:12:44 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 14336, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/10/27 06:12:44 INFO ResourceProfile: Limiting resource is cpus at 4 tasks per executor
23/10/27 06:12:44 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/10/27 06:12:44 INFO SecurityManager: Changing view acls to: hadoop
23/10/27 06:12:44 INFO SecurityManager: Changing modify acls to: hadoop
23/10/27 06:12:44 INFO SecurityManager: Changing view acls groups to: 
23/10/27 06:12:44 INFO SecurityManager: Changing modify acls groups to: 
23/10/27 06:12:44 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
23/10/27 06:12:44 INFO Utils: Successfully started service 'sparkDriver' on port 42815.
23/10/27 06:12:44 INFO SparkEnv: Registering MapOutputTracker
23/10/27 06:12:44 INFO SparkEnv: Registering BlockManagerMaster
23/10/27 06:12:44 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/10/27 06:12:44 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/10/27 06:12:44 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/10/27 06:12:44 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0cc5defb-dc15-4c7a-b4dd-392a8b830b7d
23/10/27 06:12:44 INFO MemoryStore: MemoryStore started with capacity 8.2 GiB
23/10/27 06:12:44 INFO SparkEnv: Registering OutputCommitCoordinator
23/10/27 06:12:44 INFO SubResultCacheManager: Sub-result caches are disabled.
23/10/27 06:12:45 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-standalone_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-standalone_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: Added JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-ppl_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-ppl_2.12-0.1.0-SNAPSHOT.jar with timestamp 1698387164418
23/10/27 06:12:45 INFO SparkContext: The JAR file:///home/hadoop/.ivy2/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar at spark://[2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:42815/jars/org.opensearch_opensearch-spark-sql-application_2.12-0.1.0-SNAPSHOT.jar has been added already. Overwriting of added jar is not supported in the current version.
23/10/27 06:12:45 INFO ExecutorContainerAllocator: Set total expected execs to {0=10}
23/10/27 06:12:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34899.
23/10/27 06:12:45 INFO NettyBlockTransferService: Server created on [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:34899
23/10/27 06:12:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/10/27 06:12:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16]:34899 with 8.2 GiB RAM, BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO ExecutorContainerAllocator: Going to request 10 executors for ResourceProfile Id: 0, target: 10 already provisioned: 0.
23/10/27 06:12:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, [2a05:d018:107:5a00:c2e7:cd6:3e75:ab16], 34899, None)
23/10/27 06:12:45 INFO TimeBasedRotatingEventLogFilesWriter: rotationIntervalInSeconds = 300, eventFileMinSize = 1048576, maxFilesToRetain = 2
23/10/27 06:12:45 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(8, 4, 9, 5, 10, 6, 1, 2, 7, 3)
23/10/27 06:12:45 INFO TimeBasedRotatingEventLogFilesWriter: Logging events to file:/var/log/spark/apps/eventlog_v2_00fe9epc1cjpq00q/00fe9epc1cjpq00q.inprogress
23/10/27 06:12:46 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(2 -> 08c5b7e1-1b1b-dd4b-62c2-3a7ab4c18ac8, 5 -> 26c5b7e1-1b13-e4bb-8cf6-0a60d7c7f5cc, 8 -> 84c5b7e1-1b22-e778-85e7-3b8e950e0a21, 7 -> 62c5b7e1-1b45-344a-45cf-d3deb353616f, 1 -> 78c5b7e1-1b2f-5103-7308-a83aefb136f8, 4 -> a6c5b7e1-1b3d-23b5-549e-bec6bb731d45, 6 -> 86c5b7e1-1b0c-2433-e7f0-24c75089ba24, 9 -> 12c5b7e1-1b35-dcd6-7cc4-10346fd83dfc, 3 -> 18c5b7e1-1b4c-740e-8965-7a9b4915f5e6, 10 -> 8cc5b7e1-1b05-49e2-c3a3-c6870e7b2340)
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:69d4:5763:1571:e8f5:33610) with ID 7,  ResourceProfileId 0
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:69d4:5763:1571:e8f5]:41483 with 7.9 GiB RAM, BlockManagerId(7, [2a05:d018:107:5a00:69d4:5763:1571:e8f5], 41483, None)
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:3855:6e2e:5ec2:9670:56494) with ID 2,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4:50894) with ID 8,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:7ccc:70a0:84d8:7e13:42610) with ID 5,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:b57:4798:8281:3de4:48974) with ID 6,  ResourceProfileId 0
23/10/27 06:12:50 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:ee9a:1854:d167:471b:32842) with ID 10,  ResourceProfileId 0
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:3855:6e2e:5ec2:9670]:46183 with 7.9 GiB RAM, BlockManagerId(2, [2a05:d018:107:5a00:3855:6e2e:5ec2:9670], 46183, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4]:33559 with 7.9 GiB RAM, BlockManagerId(8, [2a05:d018:107:5a00:dc42:d9ab:6f13:a7f4], 33559, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:7ccc:70a0:84d8:7e13]:35807 with 7.9 GiB RAM, BlockManagerId(5, [2a05:d018:107:5a00:7ccc:70a0:84d8:7e13], 35807, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:b57:4798:8281:3de4]:37141 with 7.9 GiB RAM, BlockManagerId(6, [2a05:d018:107:5a00:b57:4798:8281:3de4], 37141, None)
23/10/27 06:12:50 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:ee9a:1854:d167:471b]:38219 with 7.9 GiB RAM, BlockManagerId(10, [2a05:d018:107:5a00:ee9a:1854:d167:471b], 38219, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6:60494) with ID 4,  ResourceProfileId 0
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6]:43541 with 7.9 GiB RAM, BlockManagerId(4, [2a05:d018:107:5a00:d2d1:d67e:bfaf:4be6], 43541, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:836b:42d:bf25:96c4:53088) with ID 9,  ResourceProfileId 0
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:ce83:e099:9129:7f2a:56734) with ID 3,  ResourceProfileId 0
23/10/27 06:12:51 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
23/10/27 06:12:51 INFO SharedState: Warehouse path is 'file:/home/hadoop/spark-warehouse'.
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:836b:42d:bf25:96c4]:44279 with 7.9 GiB RAM, BlockManagerId(9, [2a05:d018:107:5a00:836b:42d:bf25:96c4], 44279, None)
23/10/27 06:12:51 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:ce83:e099:9129:7f2a]:43411 with 7.9 GiB RAM, BlockManagerId(3, [2a05:d018:107:5a00:ce83:e099:9129:7f2a], 43411, None)
23/10/27 06:12:51 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (2a05:d018:107:5a00:d9d9:ced2:7d72:1aed:37034) with ID 1,  ResourceProfileId 0
23/10/27 06:12:52 INFO BlockManagerMasterEndpoint: Registering block manager [2a05:d018:107:5a00:d9d9:ced2:7d72:1aed]:42027 with 7.9 GiB RAM, BlockManagerId(1, [2a05:d018:107:5a00:d9d9:ced2:7d72:1aed], 42027, None)
23/10/27 06:12:53 INFO FlintREPL: Updated job: {"jobid": 00fe9epc1cjpq00q, "sessionId": NWpKUFBobmtiZ215czM=} from .query_execution_request_mys3
23/10/27 06:12:53 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:12:53 INFO FlintREPL: read from .query_execution_request_mys3
...
......
.........
23/10/27 06:27:48 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:48 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:49 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:49 INFO FlintREPL: read from .query_execution_request_mys3
23/10/27 06:27:50 INFO FlintREPL: read from .query_execution_request_mys3
dai-chen commented 11 months ago

I think @kaituo already added default timeout for REPL job. Feel free to reopen if you still see the issue. Thanks!