[BUG] Wrong memory usage detection on yarn

smartguo commented 3 years ago

Describe the bug when deploying on yarn, or just start mars cluster in command line, we can see high memory usage on mars-web ui even without any task. Memory usage detection is based on the whole machine for now, but it should only calculate memory usage by itself on yarn. Otherwise the error w:0:MemQuotaActor met hard memory limitation: request 0, available -23158964224, hard limit 121388728320 will occur by mistake.

To Reproduce To help us reproducing this bug, please provide information below:

Your Python version: 3.7.9
The version of Mars you use: pymars[distributed]==0.6.0
Versions of crucial packages, such as numpy, scipy and protobuf: numpy==1.19.4, scipy==1.5.4, protobuf==3.14.0, pyarrow==2.0.0

Expected behavior Calculate memory usage by mars rather than the whole machine.

qinxuye commented 3 years ago

There has been an environment variable: MARS_USE_PROCESS_STAT which indicates to stat cpu and memory usage of Mars processes only. Could you please give a try and see if it can work?

smartguo commented 3 years ago

There has been an environment variable: MARS_USE_PROCESS_STAT which indicates to stat cpu and memory usage of Mars processes only. Could you please give a try and see if it can work?

I try to start MARS_USE_PROCESS_STAT=1 mars-worker -a <ip> -p <port> -s <scheduler_ip:scheduler_port> and print environment variable using mr.remote, but the problem persists.

smartguo commented 3 years ago

@qinxuye By the way, besides memory detection, cpu detection may be considered as well. Take deploying on yarn for example, if I give scheduler just 1 cpu core, task will wail scheduled and hang when the machine with other processes runing.

qinxuye commented 3 years ago

@qinxuye By the way, besides memory detection, cpu detection may be considered as well. Take deploying on yarn for example, if I give scheduler just 1 cpu core, task will wail scheduled and hang when the machine with other processes runing.

Yes, MARS_USE_PROCESS_STAT include CPU and Memory part, we will check what's happened.

smartguo commented 3 years ago

I've tested the MARS_USE_PROCESS_STAT variable as follows:

import os
from mars.deploy.yarn import new_cluster
import mars.tensor as mt

os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_191-amd64'
os.environ['HADOOP_HOME'] = "/opt/cloudera/parcels/CDH/lib/hadoop/"
os.environ['ARROW_LIBHDFS_DIR'] = "/opt/cloudera/parcels/CDH/lib64/"
cluster = new_cluster(
    environment='python:///opt/anaconda3/envs/pymodel/bin/python',
    scheduler_num=1,
    web_num=1,
    app_name="mars-app-test",
    worker_num=4,
    worker_cpu=8,
    worker_mem='16g',
    min_worker_num=2,
    worker_extra_env={
        "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", # used by pyarrow to read hdfs parquet file
        "MARS_USE_PROCESS_STAT": "1"
    },
    scheduler_extra_env={
        "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", # used by pyarrow to read hdfs parquet file
        "MARS_USE_PROCESS_STAT": "1"
    },
    worker_cache_mem='3g')

a = mt.random.rand(10, 10)
print(a.dot(a.T).execute())

With the code especially worker_cache_mem='3g', it works well. The web ui is shown as below: 截图录屏_选择区域_20201219021440

But when I drop worker_cache_mem='3g', it will hang, and then I can see the web ui as follows: 截图录屏_选择区域_20201219021430

FYI, my worker server's memory info is shown as below when it is free:

smartguo@myhost ~/mars> free -h
              total        used        free      shared  buff/cache   available
Mem:            62G        8.6G         22G        1.2G         31G         52G
Swap:            0B          0B          0B

mars-project / mars

[BUG] Wrong memory usage detection on yarn #1793