Open RJ0222 opened 3 years ago
I have the same question. Is there any progress so far?
Based on the previous experience it is just a matter of giving enough memory to spark.
@kidrocknroll were you able to solve the problem? How much memory did you allocate?
A possible workaround is to run the job more frequently on smaller chunks. With a ~15Gi daily span file, running the job every 4h works using the below spec.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
labels:
app: jaeger
component: spark-dependencies
name: jaeger-spark-5a46
namespace: jaeger
spec:
concurrencyPolicy: Allow
failedJobsHistoryLimit: 5
jobTemplate:
metadata:
spec:
template:
metadata:
labels:
app: jaeger
component: spark-dependencies
spec:
containers:
- env:
- name: STORAGE
value: elasticsearch
- name: ES_NODES
value: *****
- name: JAVA_OPTS
value: -XX:MaxRAMPercentage=75.0
- name: ES_TIME_RANGE
value: 4h
- name: ES_NODES_WAN_ONLY
value: "true"
image: *****/jaeger-spark-deps:0.0.1-2
name: jaeger-spark-5a46
resources:
limits:
cpu: "1"
memory: 8Gi
requests:
cpu: 200m
memory: 8Gi
enableServiceLinks: false
schedule: 15 */4 * * *
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 1
suspend: false
In my Kubernetes cluster, jaeger-spark starts every 8 hours. 1 starts at night, 2 during the day, 3 in the evening. Only the night job works successfully. In the Kubernetes cluster, taint / tolerations is configured for job jaeger-spark, that is, the pod is created only on the dedicated node.
The following resources have been allocated for the pod:
resources:
limits:
cpu: 8192m
memory: 100Gi
requests:
cpu: 4096m
memory: 100Gi
And the heap-size is as follows:
- name: JAVA_OPTS
value: "-Xms100g -Xmx100g"
The span is around 200G, but OOM kills the pod. Span sizes and memory indicators in the screenshots. Can you please tell me what is the problem with this memory consumption? And how much memory is needed for the pod in this case?
We encountered the same issue.
Using the latest container image (not available in DockerHub, but only in ghcr.io) fixed the issue for us. Maybe, because of JRE 11 instead of JRE 8, which uses +UseContainerSupport
by default.
I've encountered the same problem. My technical environment is an Ubuntu Virtual Machine with 32g of ram and 250g of storage space. So I moved the direction of the temp files to part of the disk. My disk is divided into two parts. The second part is called data/ and it contains more than 80% of the 250 g of storage; so I have the /data/temp/ directory. So I assigned this directory to the variable local.dir". So we have: "spark.local.dir=/data/tmp". here's how I solved it:
I run spark with this configuration : pyspark --packages io.delta:delta-core_2.12:2.3.0 \ --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \ --conf "spark.executor.instances=10" \ --conf "spark.driver.memory=32g" \ --conf "spark.executor.memory=32g" \ --conf "spark.memory.fraction=0.9" \ --conf "spark.executor.heartbeatInterval=30s" \ --conf "spark.network.timeout=600s" \ --conf "spark.task.maxFailures=10" \ --conf "spark.sql.files.maxPartitionBytes=512m" \ --conf "spark.sql.debug.maxToStringFields=1000" \ --conf "spark.sql.parquet.int96RebaseModeInWrite=LEGACY" \ --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M" \ --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M" \ --conf "spark.network.timeout=300s" \ --conf "spark.driver.cores=8" \ --conf "spark.local.dir=/data/tmp"
And In my jupyter notebook configuration, I run this configuration:
spark_conf = pyspark.SparkConf() \ .setAppName("myApp") \ .set("spark.driver.maxResultSize", "32g") \ .set("spark.sql.debug.maxToStringFields", "1000") \ .set("spark.jars", "postgresql-42.6.0.jar") \ .set("spark.driver.extraClassPath", "./postgresql-42.6.0.jar") \ .set("spark.sql.autoBroadcastJoinThreshold", "-1") \ .set("spark.ui.showConsoleProgress", "false") \ .set("spark.executor.memoryOverhead", "600") \ .set("spark.executor.heartbeatInterval", "120s") \ .set("spark.sql.adaptive.enabled", "true") \ .set("spark.sql.adaptive.skewJoin.enabled", "true") \ .set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "3") \ .set("spark.memory.fraction", "0.9") \ .set("spark.driver.memory", "32g") \ .set("spark.executor.memory", "32g") \ .set("spark.task.maxFailures", "10") \ .set("spark.sql.files.maxPartitionBytes", "512m") \ .set("spark.sql.parquet.int96RebaseModeInWrite", "LEGACY") \ .set("spark.rpc.numRetries", "5") \ .set("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M") \ .set("spark.driver.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M") \
Et Dans ma configuration du notebook jupyter, je lance cette configuration:
pyspark --packages io.delta:delta-core_2.12:2.3.0 \ --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \ --conf "spark.executor.instances=10" \ --conf "spark.driver.memory=32g" \ --conf "spark.executor.memory=32g" \ --conf "spark.memory.fraction=0.9" \ --conf "spark.executor.heartbeatInterval=30s" \ --conf "spark.network.timeout=600s" \ --conf "spark.task.maxFailures=10" \ --conf "spark.sql.files.maxPartitionBytes=512m" \ --conf "spark.sql.debug.maxToStringFields=1000" \ --conf "spark.sql.parquet.int96RebaseModeInWrite=LEGACY" \ --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M" \ --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=32M" \ --conf "spark.network.timeout=300s" \ --conf "spark.driver.cores=8" \ --conf "spark.local.dir=/data/tmp"
Are there any plans to optimize resource usage? I am unable to process 20GB spans with 128GB of memory.
Problem
How much memory does a spark-dependencies job take while handling about 12Gb data index?
I am totally new to the spark project and I have tried serval times to run a spark-dependencies job to create the DAG.
It always came with the error below even though I have adjusted the memory limit to about 28Gi.
Sometimes even a
copyOfRange
error occurs.Environment
spark job configuration
ES data size
Is there a way to solve this problem not by adding the memory limit? or it is just a usage problem that I have
Any suggestions or tips would be greatly appreciated.