Open yatinece opened 3 years ago
Hi, your application hit memory limitation set for Yarn container. If this memory usage is correct, then you can increase max container memory on Yarn or remove memory limitation check. If it used more memory than expected, there maybe memory leak or other exceptions.
@yangw1234 for TFpark related memory consumption.
@qiyuangong , I think the container is taking a lot more memory than expected. I think it is not clearing the old GC when training is running and hence it run's out of memory. Any way I can test this or solve this?
@qiyuangong , I think the container is taking a lot more memory than expected. I think it is not clearing the old GC when training is running and hence it run's out of memory. Any way I can test this or solve this?
Hi @yatinece . I don't think Yarn container took too much memory. Yarn container is only a resource unit for application. According to my experience, your training application is trying to consume more memory than it applied from Yarn, i.e., 39GB. So, Nodemanager killed this application by killing this container.
@qiyuangong : Thanks, I increased the container memory to 79G also. We have a cap of 80GB, the only thing changed was that the model trained for 1 more epoch before the same error. So I think continuous training keep on increasing memory requirement. New to this , so may be using very generic terms.
@yatinece I wonder if there is a way for you to provide some sample code as well as the parameters (e.g., model size, data size, job parameters and configurations, etc.), so that we can try to replicate it on our side to understand why it needs large memory size.
In addition, an alternative is to try to restart the training from check-point if the job fails (instead of from scratch). @yangw1234 please provide some examples on how to do this.
@jason-dai Following are details. Train data has 12.4 M rows.
Some parts of code @yangw1234 : if you can share the checkpoint logic or suggest another thing, I can test that as well.
############Library################
import argparse
import csv
import os
import time
import datetime
import numpy as np
import importlib
import tensorflow as tf
import pandas as pd
import pyspark.sql.functions as F
from bigdl.optim.optimizer import TrainSummary, ValidationSummary
from bigdl.optim.optimizer import *
from bigdl.optim.optimizer import *
from pyspark.sql.functions import when
from bigdl.util.common import init_engine
from bigdl.util.engine import prepare_env
from pyspark.sql import HiveContext, Row, SparkSession
from pyspark.sql.functions import concat, col, udf, lit,to_timestamp
from pyspark.sql.types import FloatType,DoubleType,ArrayType, IntegerType
from pyspark.ml.feature import StringIndexer,VectorAssembler,CountVectorizer
from zoo import init_nncontext, init_spark_conf
from zoo.orca.learn.tf.estimator import Estimator
from zoo.tfpark import TFOptimizer,KerasModel, TFDataset, TFPredictor
from bigdl.optim.optimizer import MaxEpoch, EveryEpoch
from zoo.pipeline.api.keras.metrics import Accuracy, BinaryAccuracy
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import pandas as pd
import numpy as np
import os
import time
import sys
from zoo.tfpark import KerasModel, TFDataset, TFPredictor
import tensorflow as tf
from keras import backend as K
############Embedding shape################
dict_size={'Feature1': (68, 34),
'Feature2': (3, 2),
'Feature3': (7, 4),
'Feature4': (74, 37),
'Feature5': (14, 7),
'Feature6': (37, 19),
'Feature7': (2, 1),
'Feature8': (2, 1),
'Feature9': (11, 6),
'Feature10': (5, 3),
'Feature11': (13, 7),
'Feature12': (2, 1),
'Feature13': (41, 21),
'Feature14': (6, 3),
'Feature15': (27, 14),
'Feature16': (7, 4),
'Feature17': (372, 50),
'Feature18': (6690, 50),
'Feature19': (3, 2),
'Feature20': (3, 2),
'Feature21': (2, 1),
'Feature22': (3, 2),
'Feature23': (2, 1),
'Feature24': (2, 1),
'Feature25': (4, 2),
'Feature26': (2, 1),
'Feature27': (2, 1),
'Feature28': (2, 1)}
features = ['Feature1', 'Feature2',
'Feature3', 'Feature4', 'Feature5', 'Feature6','Feature7',
'Feature8', 'Feature9', 'Feature10',
'Feature11', 'Feature12','Feature13','Feature14',
'Feature15','Feature16','Feature17',
'Feature18', 'Feature19','Feature20',
'Feature21','Feature22','Feature23','Feature24','Feature25',
'Feature26','Feature27' ,'Feature28']
dense_features = ['Feature29','Feature30']
target_feature= 'label'
############model################
def create_model( catcols, dense_features_n, m):
inputs = []
outputs = []
outputs_dense = []
categorical_output = []
layers_dict={}
iic=0
for c in catcols:
iic=iic+1
num_unique_values,embed_dim = dict_size[c]
layers_dict[c] = layers.Input(shape=(1,),name='input_'+str(iic))
out = layers.Embedding(num_unique_values + 1, embed_dim)(layers_dict[c])
out = layers.SpatialDropout1D(0.3)(out)
out = layers.Reshape(target_shape=(embed_dim, ))(out)
out_c = layers.Embedding(num_unique_values + 1, m)(layers_dict[c])
out_c = layers.SpatialDropout1D(0.3)(out_c)
out_c = layers.Reshape(target_shape=(m, ))(out_c)
inputs.append(layers_dict[c])
outputs.append(out)
categorical_output.append(out_c)
for c in (dense_features_n):
iic=iic+1
layers_dict[c] = layers.Input(shape=(1,),name='input_'+str(iic))
inputs.append(layers_dict[c])
outputs.append(layers_dict[c])
x = layers.Concatenate()(outputs)
x = layers.BatchNormalization()(x)
x = layers.Dense(256, activation="relu")(x)
x = layers.Dropout(0.3)(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(256, activation="relu")(x)
x = layers.Dropout(0.3)(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.3)(x)
x = layers.BatchNormalization()(x)
deep_out = layers.Dropout(0.15)(x)
first_order = [emb1 for emb1 in categorical_output]
second_order = []
for emb1, emb2 in combinations(categorical_output, 2):
dot_layer = dot([emb1, emb2], axes=1, normalize=False)
second_order.append(dot_layer)
first_order = Add()(first_order)
second_order = Add()(second_order)
total_out = layers.Concatenate()([deep_out,first_order,second_order])
total_out = layers.Dense(100, activation="relu")(total_out)
y = layers.Dense(1, activation="sigmoid")(total_out)
model = Model(inputs=inputs, outputs=y)
model.compile(loss= 'binary_crossentropy' , optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5))
return model
############Model size################
#Total params: 585,839
#Trainable params: 583,999
#Non-trainable params: 1,840
############Read data################
train_df=spark.read.parquet(loc)
test_df=spark.read.parquet(loc)
train_df=train_df.select(features+dense_features+[ target_feature])
test_df=test_df.select(features+dense_features+[ target_feature])
from pyspark.sql.functions import col
train_df=train_df.select(*(col(c).cast("float").alias(c) for c in train_df.columns))
test_df=test_df.select(*(col(c).cast("float").alias(c) for c in test_df.columns))
from pyspark.ml.feature import VectorAssembler
features_list=features+dense_features
ik=0
for col in features_list:
ik=ik+1
print(str(ik)+ ' print for vector')
vectorAssembler_input_2 = VectorAssembler(inputCols=[col], outputCol='input_'+str(ik))
train_df = vectorAssembler_input_2.transform(train_df)
ik=0
for col in features_list:
ik=ik+1
print(str(ik)+ ' print for vector')
vectorAssembler_input_2 = VectorAssembler(inputCols=[col], outputCol='input_'+str(ik))
test_df = vectorAssembler_input_2.transform(test_df)
print("start modeling train")
print("model name is "+model_name)
# create instance of model
print("remote model save path is "+ save_to_remote_dir)
print("*"*80)
os.system('hdfs dfs -rm -r ' + str(Final_model_save_to_remote_dir))
print(create_model( features, dense_features, 5).summary())
est_ck = Estimator.from_keras(create_model( features, dense_features, 5),model_dir=save_to_remote_dir)
est_ck.fit(data=train_df,batch_size=batch_size,epochs=max_epoch,feature_cols=[ 'input_'+str(k+1) for k in range(ik)],labels_cols=[ target_feature],validation_data= test_df,checkpoint_trigger=EveryEpoch())
hi @yatinece here is an example on loading checkpoint
@yangw1234 : Let me try and let you know. Any other suggestion on why memory is being occupied?
@yatinece we are still looking into the issue. Would you mind sharing your application resource configuration, like the number of nodes, the number of cores per node, node memory, etc?
@yatinece it seems the problem is caused by a memory leak issue in intel optimized tensorflow. Can you try setting the TF_DISABLE_MKL
environment variable to "1" on each spark exeuctor, to disable it?
Here is the code if you are using init_orca_context
init_orca_context(..., conf={"spark.executorEnv.TF_DISABLE_MKL": "1"})
@yangw1234 : Thanks for this. i am using this to initiate the session :- ( will update once the job runs)
sparkConf = init_spark_conf()
sc = init_nncontext(sparkConf)
spark = SparkSession \
.builder \
.appName(app_name) \
.getOrCreate()
init_engine()
I have now changed init_spark_conf() to init_spark_conf(conf={"spark.executorEnv.TF_DISABLE_MKL": "1"})
i have been using following settings. i am changing num-executors ,batch size and XX:ParallelGCThreads to see , if it can run .
spark-submit \ --conf spark.yarn.dist.archives=hdfs://xxx/venv.zip \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv.zip/venv/bin/python \ --conf spark.pyspark.python=venv.zip/venv/bin/python \ --conf spark.pyspark.driver.python=venv.zip/venv/bin/python \ --conf spark.executor.memoryOverhead=6096 \ --conf spark.driver.memoryOverhead=6096 \ --master yarn \ --deploy-mode cluster \ --conf spark.port.maxRetries=100 \ --conf spark.yarn.executor.memoryOverhead=5000000000 \ --executor-memory 33g \ --driver-memory 33g \ --executor-cores 2 \ --num-executors 160 \ --queue root.user.yyy \ --conf spark.executor.extraJavaOptions="-Xss512m" \ --conf spark.driver.extraJavaOptions="-Xss512m" \ --driver-java-options "-XX:MaxPermSize=2G -XX:+UseG1GC" \ --conf spark.driver.extraJavaOptions="-XX:ParallelGCThreads=28 -XX:+UseParallelGC -XX:+UseParallelOldGC " \ --conf spark.executor.extraJavaOptions="-XX:ParallelGCThreads=28 -XX:+UseParallelGC -XX:+UseParallelOldGC" \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.shuffle.partitions=200 \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.ui.showConsoleProgress=true \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.locality.wait=2s \ --conf spark.driver.maxResultSize=10g \ --conf spark.sql.parquet.binaryAsString=true \ --conf spark.shuffle.consolidateFiles=false \ --conf spark.rdd.compress=true \ --conf spark.stage.maxConsecutiveAttempts=30 \ --jars hdfs://xxx/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0.jar \ --files hdfs://xxx/spark-analytics-zoo.conf \ --py-files hdfs://xxx/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip hdfs://xxx4/abc_v1.py
@yangw1234 :- The suggested setting is working. Thanks for this.
For checkpoint suggestion:- i implemented a code like this : -
max_epoch=15
step=5
for k in range(max_epoch//step):
tf.reset_default_graph()
model=create_model( features, dense_features, 5)
est_ck = Estimator.from_keras(keras_model=model,model_dir=save_to_remote_dir)
if k >0 : est_ck.load_latest_orca_checkpoint(save_to_remote_dir)
est_ck.fit(data=train_df,batch_size=batch_size,epochs=step,feature_cols=[ 'input_'+str(k+1) for k in range(ik)],labels_cols=[ target_feature],validation_data= test_df,checkpoint_trigger=SeveralIteration(4))
eval_result = est_ck.evaluate(test_df,feature_cols=[ 'input_'+str(k+1) for k in range(ik)],labels_cols=[ target_feature],hard_code_batch_size=False)
print(eval_result)
print('scoring the model' +"_epoch_"+str(k))
output :-
{'loss': 0.5859652161598206}
scoring the model_epoch_0
Traceback (most recent call last):
File "xxxx.py", line 323, in <module>
if k >0 : est_ck.load_latest_orca_checkpoint(save_to_remote_dir)
File "/xxxx/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/orca/learn/tf/estimator.py", line 45, in load_latest_orca_checkpoint
Exception: Cannot find checkpoint
It is not saving the checkpoint to be loaded. Do you see any mistake : -
@yatinece Sorry for the late response. I was on leave for the last few days.
The save_to_remte_dir
variable, is it a remote path like “hdfs://xxxx” or a local file system path?
load_latest_orca_checkpoint
does not support "hdfs://" yet, we are currently working on it.
@yangw1234 :- I am running the codes via cluster mode (--deploy-mode cluster)in spark. So I think in this case the checkpoint will not work. Do I need to move this to client mode to save the file in the local system?
@yatinece we are working on the hdfs support https://github.com/intel-analytics/analytics-zoo/pull/3091
@yangw1234 since intel-analytics/analytics-zoo#3091 is already merged, any suggestion for next step?
this issue has been fixed
@yangw1234 @jenniew This is the previous memory leak issue; is it fixed?
I'm still verifying if the intel mkl patch can fix memory leak issue.
Hi ,
I am trying to train the implementation of Factorization machine using analytics zoo orca estimator. The problem I face is I get the following error when I run it for big epochs
2020-10-22 06:19:25 ERROR YarnClusterScheduler:73 - Lost executor 8 on hdp2stl020386.mastercard.int: Container killed by YARN for exceeding memory limits. 39.0 GB of 39 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. 2020-10-22 06:19:46 ERROR TaskSetManager:73 - Task 16 in stage 63971.0 failed 4 times; aborting job Traceback (most recent call last): File "digital_standard_score.py", line 311, in
est_ck.fit(data=train_df,batch_size=batch_size,epochs=max_epoch,featurecols=[ 'input'+str(k+1) for k in range(ik)],labels_cols=[ target_feature],validation_data= test_df,checkpoint_trigger=EveryEpoch())
File "/dfs/2/yarn/nm/usercache/e074454/appcache/application_1602667222169_23829/container_e20_1602667222169_23829_01_000007/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/orca/learn/tf/estimator.py", line 476, in fit
File "/dfs/2/yarn/nm/usercache/e074454/appcache/application_1602667222169_23829/container_e20_1602667222169_23829_01_000007/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/tfpark/tf_optimizer.py", line 744, in optimize
File "/dfs/2/yarn/nm/usercache/e074454/appcache/application_1602667222169_23829/container_e20_1602667222169_23829_01_000007/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/pipeline/estimator/estimator.py", line 168, in train_minibatch
File "/dfs/2/yarn/nm/usercache/e074454/appcache/application_1602667222169_23829/container_e20_1602667222169_23829_01_000007/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/common/utils.py", line 133, in callZooFunc
File "/dfs/2/yarn/nm/usercache/e074454/appcache/application_1602667222169_23829/container_e20_1602667222169_23829_01_000007/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.9.0-python-api.zip/zoo/common/utils.py", line 127, in callZooFunc
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p4204.5870739/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p4204.5870739/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p4204.5870739/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o118.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 63971.0 failed 4 times, most recent failure: Lost task 16.3 in stage 63971.0 (TID 639718, hdp2stl020395.mastercard.int, executor 21): java.util.concurrent.ExecutionException: java.lang.RuntimeException: Didn't find weight block test_0weightBytes14 in the block manager. Did you initialize this AllReduceParameter on every executor?
This is the discussion on the error in analytics zoo :- https://github.com/intel-analytics/analytics-zoo-internal/issues/860 which say this is a memory issue. I have increased the executor memory and reduce the batch size, but it will delay the memory exception. If it was failing in 2 epoch, now it will fail in 3rd epoch.
Is it an issue that the job keeps on capturing more memory as it runs?. Can we do something to reduce memory consumption?.