Open smartguo opened 3 years ago
There has been an environment variable: MARS_USE_PROCESS_STAT which indicates to stat cpu and memory usage of Mars processes only. Could you please give a try and see if it can work?
There has been an environment variable: MARS_USE_PROCESS_STAT which indicates to stat cpu and memory usage of Mars processes only. Could you please give a try and see if it can work?
I try to start MARS_USE_PROCESS_STAT=1 mars-worker -a <ip> -p <port> -s <scheduler_ip:scheduler_port>
and print environment variable using mr.remote
, but the problem persists.
@qinxuye By the way, besides memory detection, cpu detection may be considered as well. Take deploying on yarn for example, if I give scheduler just 1 cpu core, task will wail scheduled and hang when the machine with other processes runing.
@qinxuye By the way, besides memory detection, cpu detection may be considered as well. Take deploying on yarn for example, if I give scheduler just 1 cpu core, task will wail scheduled and hang when the machine with other processes runing.
Yes, MARS_USE_PROCESS_STAT include CPU and Memory part, we will check what's happened.
I've tested the MARS_USE_PROCESS_STAT
variable as follows:
import os
from mars.deploy.yarn import new_cluster
import mars.tensor as mt
os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_191-amd64'
os.environ['HADOOP_HOME'] = "/opt/cloudera/parcels/CDH/lib/hadoop/"
os.environ['ARROW_LIBHDFS_DIR'] = "/opt/cloudera/parcels/CDH/lib64/"
cluster = new_cluster(
environment='python:///opt/anaconda3/envs/pymodel/bin/python',
scheduler_num=1,
web_num=1,
app_name="mars-app-test",
worker_num=4,
worker_cpu=8,
worker_mem='16g',
min_worker_num=2,
worker_extra_env={
"ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", # used by pyarrow to read hdfs parquet file
"MARS_USE_PROCESS_STAT": "1"
},
scheduler_extra_env={
"ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", # used by pyarrow to read hdfs parquet file
"MARS_USE_PROCESS_STAT": "1"
},
worker_cache_mem='3g')
a = mt.random.rand(10, 10)
print(a.dot(a.T).execute())
With the code especially worker_cache_mem='3g'
, it works well. The web ui is shown as below:
But when I drop worker_cache_mem='3g'
, it will hang, and then I can see the web ui as follows:
FYI, my worker server's memory info is shown as below when it is free:
smartguo@myhost ~/mars> free -h
total used free shared buff/cache available
Mem: 62G 8.6G 22G 1.2G 31G 52G
Swap: 0B 0B 0B
Describe the bug when deploying on yarn, or just start mars cluster in command line, we can see high memory usage on mars-web ui even without any task. Memory usage detection is based on the whole machine for now, but it should only calculate memory usage by itself on yarn. Otherwise the error
w:0:MemQuotaActor met hard memory limitation: request 0, available -23158964224, hard limit 121388728320
will occur by mistake.To Reproduce To help us reproducing this bug, please provide information below:
Expected behavior Calculate memory usage by mars rather than the whole machine.