microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

failed during to run job #1819

Closed jnpr-yhzhan closed 5 years ago

jnpr-yhzhan commented 5 years ago

Start Time: --

Finish Time: 2018/12/4 下午1:53:08

Exit Diagnostics:

Failed to submit application due to non-transient error, maybe application is non-compliant. Exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=4096, maxMemory=3072 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:266) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:197) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:497) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:360) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:300) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:636) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:262) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:280) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy14.submitApplication(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:270) at com.microsoft.frameworklauncher.common.utils.HadoopUtils.lambda$submitApplication$0(HadoopUtils.java:188) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at com.microsoft.frameworklauncher.common.utils.HadoopUtils.submitApplication(HadoopUtils.java:184) at com.microsoft.frameworklauncher.service.Service.launchApplication(Service.java:577) at com.microsoft.frameworklauncher.service.Service.lambda$setupApplicationContext$1(Service.java:409) at com.microsoft.frameworklauncher.common.service.SystemTaskQueue.lambda$setupTaskExceptionHandler$0(SystemTaskQueue.java:81) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=4096, maxMemory=3072 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:266) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:197) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:497) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:360) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:300) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:636) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:262) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:515) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.Client.call(Client.java:1349) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy13.submitApplication(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:277) ... 26 more

fanyangCS commented 5 years ago

" Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=4096, maxMemory=3072" Please check your job submission configuration to reduce the required memory.

jnpr-yhzhan commented 5 years ago

I tried to reduce the job to use 1024M memory, but it doesn't take effect, the log always print out 'requestedMemory=4096, maxMemory=3072', is there any way to change the both parameters to make it work?

mzmssg commented 5 years ago

@fanyangCS I suspect the 4G memory need is for AM, changing job description doesn't help.

jnpr-yhzhan commented 5 years ago

@mzmssg Thanks for the quick response, is there any quick fix or workaround? Otherwise, is there any way to increase the maxMemory so that I can make the job up?

mzmssg commented 5 years ago

@jnpr-yhzhan Could you provide more hardware details? And provide the yarn metrix at host_ip:8088

Currently, I think there are two options for a quick evalution, please don't use them in production enviroment:

  1. Reduce the AM memory request, build yarn-frameworklauncher by yourself. Then deploy with your own image. https://github.com/Microsoft/pai/blob/c697d14ff675ebf6322850d0a2277e1f35262a8c/subprojects/frameworklauncher/yarn/src/main/java/com/microsoft/frameworklauncher/common/model/PlatformSpecificParametersDescriptor.java#L35

  2. Reduce the mem_reserved here, then redeploy it. It will "trick" the system with some memory, so you could request resource sucessfully, but we can't guarantee this job could get that much real memory. https://github.com/Microsoft/pai/blob/c697d14ff675ebf6322850d0a2277e1f35262a8c/src/hadoop-node-manager/deploy/hadoop-node-manager-configuration/nodemanager-generate-script.sh#L69

Option2 seems to be a more simple methods. Hope it helps.

jnpr-yhzhan commented 5 years ago

@mzmssg Master with 8C 32G, and two worker nodes with 4C 16G memory

jnpr-yhzhan commented 5 years ago

@mzmssg actually I'm not sure why the "maxMemory=3072" setting come from? is it a hard code setting, since there is no such setting in yarn-site configuration.

Capacity Scheduler [MEMORY, CPU, GPU] <memory:1024, vCores:1, GPUs:0> <memory:3072, vCores:4, GPUs:0>
jnpr-yhzhan commented 5 years ago

Seems there is no yarn-site.xml on node-manager:

root@iZ8vb7msyiau0hs41num3oZ:~# docker exec -it 26515032eeb4 /bin/bash root@iZ8vb7msyiau0hs41num3oZ:/# cd hadoop-configuration/ root@iZ8vb7msyiau0hs41num3oZ:/hadoop-configuration# ll total 12 drwxrwxrwx 3 root root 4096 Dec 4 05:39 ./ drwxr-xr-x 1 root root 4096 Dec 4 05:39 ../ drwxr-xr-x 2 root root 4096 Dec 4 05:39 ..2018_12_04_05_39_03.806503402/ lrwxrwxrwx 1 root root 31 Dec 4 05:39 ..data -> ..2018_12_04_05_39_03.806503402/ lrwxrwxrwx 1 root root 20 Dec 4 05:39 core-site.xml -> ..data/core-site.xml lrwxrwxrwx 1 root root 20 Dec 4 05:39 hadoop-env.sh -> ..data/hadoop-env.sh lrwxrwxrwx 1 root root 20 Dec 4 05:39 hdfs-site.xml -> ..data/hdfs-site.xml lrwxrwxrwx 1 root root 22 Dec 4 05:39 mapred-site.xml -> ..data/mapred-site.xml lrwxrwxrwx 1 root root 34 Dec 4 05:39 namenode-generate-script.sh -> ..data/namenode-generate-script.sh lrwxrwxrwx 1 root root 32 Dec 4 05:39 namenode-start-service.sh -> ..data/namenode-start-service.sh lrwxrwxrwx 1 root root 18 Dec 4 05:39 yarn-env.sh -> ..data/yarn-env.sh root@iZ8vb7msyiau0hs41num3oZ:/hadoop-configuration#

yqwang-ms commented 5 years ago

You can submit framework by specifying amResource to control AM resource, such as:

Note this part

  "platformSpecificParameters": {
    "amResource": {
      "cpuNumber": 1,
      "memoryMB": 1024,
      "diskType": 0,
      "diskMB": 0
    },

inside:

{
  "description": "HBaseOnYarn",
  "version": 10,
  "retryPolicy": {
    "maxRetryCount": -1,
    "fancyRetryPolicy": false
  },
  "taskRoles": {
    "HBaseMaster": {
      "taskNumber": 14,
      "priority": 15,
      "scaleUnitNumber": 16,
      "scaleUnitTimeoutSec": 17,
      "taskRetryPolicy": {
        "maxRetryCount": -2,
        "fancyRetryPolicy": false
      },
      "taskService": {
        "version": 23,
        "entryPoint": "HBaseMaster\\start.bat -Port \"10241\" -NodeNumber \"50\" -ThreadNumber \"10\" -Transport \"tcp\" -InData \"https://cosmos08.osdinfra.net/cosmos/bingads.algo.VC1/local/Aether/_5/huke/564d73f6-2647-4b15-9aae-945554856601@@@-Prod-_WoodBlocks_Feature_Extraction_V2@@@85545fbe@@@3-15-2017_03-39-27_AM/OutputPath/\" -InModel \"\" -HdfsOutDir \"hdfs://BN2:8020/YarnLauncher/output/ChanaOnYarn_LR_FTRL_application_1489560926060_6177/\" -data_format \"binary\" -max_feature_id \"1\" -time_series \"true\" -time_series_num \"1\" -minibatch \"1000\" -num_data_pass \"1\" -concurrent_sessions \"2\" -alpha \"0.05\" -beta \"0.1\" -l1 \"2\" -l2 \"0\" -output_zn \"true\" -print_status_interval_in_second \"300\" -separate_L1 \"00000000000000000000#0000FFFFFFFFFFFFFFFF#5.096858#0$00820000000000000000#0082FFFFFFFFFFFFFFFF#2.8633378#0$008A0000000000000000#008AFFFFFFFFFFFFFFFF#4.0363359#0$00910000000000000000#0091FFFFFFFFFFFFFFFF#3.06106#0$00A00000000000000000#00A0FFFFFFFFFFFFFFFF#0.49991453#0$00A50000000000000000#00A5FFFFFFFFFFFFFFFF#0.48833743#0$00A60000000000000000#00A6FFFFFFFFFFFFFFFF#0.49788499#0$00B40000000000000000#00B4FFFFFFFFFFFFFFFF#6.069901#0$00B50000000000000000#00B5FFFFFFFFFFFFFFFF#4.1982675#0$00B60000000000000000#00B6FFFFFFFFFFFFFFFF#0.7419309#0$00B70000000000000000#00B7FFFFFFFFFFFFFFFF#5.925684#0$00FF0000000000000000#00FFFFFFFFFFFFFFFFFF#4.8863711#0$04A70000000000000000#04A7FFFFFFFFFFFFFFFF#0.080657169#0$04B90000000000000000#04B9FFFFFFFFFFFFFFFF#4.6246343#0$107A0000000000000000#107AFFFFFFFFFFFFFFFF#0.49868357#0$107B0000000000000000#107BFFFFFFFFFFFFFFFF#0.45628375#0$10900000000000000000#1090FFFFFFFFFFFFFFFF#4.0900931#0$10A10000000000000000#10A1FFFFFFFFFFFFFFFF#0.49990368#0$10B30000000000000000#10B3FFFFFFFFFFFFFFFF#9.1831741#0$11050000000000000000#1105FFFFFFFFFFFFFFFF#5.2549124#0$11060000000000000000#1106FFFFFFFFFFFFFFFF#5.3109674#0$11080000000000000000#1108FFFFFFFFFFFFFFFF#5.2125664#0$110A0000000000000000#110AFFFFFFFFFFFFFFFF#4.1604204#0$110B0000000000000000#110BFFFFFFFFFFFFFFFF#5.6908121#0$110E0000000000000000#110EFFFFFFFFFFFFFFFF#6.1170983#0$11130000000000000000#1113FFFFFFFFFFFFFFFF#4.7748647#0$11160000000000000000#1116FFFFFFFFFFFFFFFF#5.2162075#0$11300000000000000000#1130FFFFFFFFFFFFFFFF#5.5385509#0$11310000000000000000#1131FFFFFFFFFFFFFFFF#9.6901684#0$11500000000000000000#1150FFFFFFFFFFFFFFFF#4.614161#0$11550000000000000000#1155FFFFFFFFFFFFFFFF#1.7929199#0$115D0000000000000000#115DFFFFFFFFFFFFFFFF#4.7363648#0$115E0000000000000000#115EFFFFFFFFFFFFFFFF#2.5856776#0$115F0000000000000000#115FFFFFFFFFFFFFFFFF#6.9741712#0$11860000000000000000#1186FFFFFFFFFFFFFFFF#4.3861442#0$11880000000000000000#1188FFFFFFFFFFFFFFFF#6.9425874#0$118B0000000000000000#118BFFFFFFFFFFFFFFFF#8.5497856#0$118F0000000000000000#118FFFFFFFFFFFFFFFFF#10.423334#0$11A20000000000000000#11A2FFFFFFFFFFFFFFFF#2.1936097#0$11A30000000000000000#11A3FFFFFFFFFFFFFFFF#4.6849184#0$11A40000000000000000#11A4FFFFFFFFFFFFFFFF#0.38887253#0$12050000000000000000#1205FFFFFFFFFFFFFFFF#5.2201838#0$12060000000000000000#1206FFFFFFFFFFFFFFFF#5.0993829#0$12080000000000000000#1208FFFFFFFFFFFFFFFF#5.109931#0$120A0000000000000000#120AFFFFFFFFFFFFFFFF#4.5182581#0$120B0000000000000000#120BFFFFFFFFFFFFFFFF#5.2759089#0$120E0000000000000000#120EFFFFFFFFFFFFFFFF#9.391984#0$12130000000000000000#1213FFFFFFFFFFFFFFFF#5.3143868#0$12160000000000000000#1216FFFFFFFFFFFFFFFF#8.7217474#0$12300000000000000000#1230FFFFFFFFFFFFFFFF#6.9777527#0$12310000000000000000#1231FFFFFFFFFFFFFFFF#8.714057#0$12500000000000000000#1250FFFFFFFFFFFFFFFF#5.4010749#0$12550000000000000000#1255FFFFFFFFFFFFFFFF#7.8295522#0$125D0000000000000000#125DFFFFFFFFFFFFFFFF#5.8567038#0$125E0000000000000000#125EFFFFFFFFFFFFFFFF#4.3967152#0$125F0000000000000000#125FFFFFFFFFFFFFFFFF#8.3586245#0$12860000000000000000#1286FFFFFFFFFFFFFFFF#6.2377419#0$128B0000000000000000#128BFFFFFFFFFFFFFFFF#10.007208#0$128C0000000000000000#128CFFFFFFFFFFFFFFFF#5.6566091#0$12A20000000000000000#12A2FFFFFFFFFFFFFFFF#6.605001#0$12A30000000000000000#12A3FFFFFFFFFFFFFFFF#0.55846989#0$12A40000000000000000#12A4FFFFFFFFFFFFFFFF#0.49973476#0$13050000000000000000#1305FFFFFFFFFFFFFFFF#5.2252655#0$13060000000000000000#1306FFFFFFFFFFFFFFFF#5.2523742#0$13080000000000000000#1308FFFFFFFFFFFFFFFF#5.2337542#0$130A0000000000000000#130AFFFFFFFFFFFFFFFF#4.5780139#0$130B0000000000000000#130BFFFFFFFFFFFFFFFF#4.0439329#0$130E0000000000000000#130EFFFFFFFFFFFFFFFF#2.1293952#0$13130000000000000000#1313FFFFFFFFFFFFFFFF#3.6117384#0$13160000000000000000#1316FFFFFFFFFFFFFFFF#2.5136757#0$13300000000000000000#1330FFFFFFFFFFFFFFFF#6.9141841#0$13310000000000000000#1331FFFFFFFFFFFFFFFF#6.6093621#0$13320000000000000000#1332FFFFFFFFFFFFFFFF#2.2691624#0$13500000000000000000#1350FFFFFFFFFFFFFFFF#6.6277962#0$13550000000000000000#1355FFFFFFFFFFFFFFFF#5.8656178#0$135D0000000000000000#135DFFFFFFFFFFFFFFFF#5.1534843#0$135E0000000000000000#135EFFFFFFFFFFFFFFFF#0.49973911#0$135F0000000000000000#135FFFFFFFFFFFFFFFFF#9.0955544#0$138B0000000000000000#138BFFFFFFFFFFFFFFFF#8.7416525#0$13A20000000000000000#13A2FFFFFFFFFFFFFFFF#0.47592476#0$13A30000000000000000#13A3FFFFFFFFFFFFFFFF#8.8548536#0$13A40000000000000000#13A4FFFFFFFFFFFFFFFF#6.9656701#0$14050000000000000000#1405FFFFFFFFFFFFFFFF#5.0867486#0$14060000000000000000#1406FFFFFFFFFFFFFFFF#5.03652#0$14080000000000000000#1408FFFFFFFFFFFFFFFF#5.1119409#0$140A0000000000000000#140AFFFFFFFFFFFFFFFF#4.6629124#0$140B0000000000000000#140BFFFFFFFFFFFFFFFF#7.1706429#0$140E0000000000000000#140EFFFFFFFFFFFFFFFF#3.07722#0$14110000000000000000#1411FFFFFFFFFFFFFFFF#5.553031#0$14130000000000000000#1413FFFFFFFFFFFFFFFF#7.2936106#0$14160000000000000000#1416FFFFFFFFFFFFFFFF#4.366282#0$14300000000000000000#1430FFFFFFFFFFFFFFFF#3.4097433#0$14310000000000000000#1431FFFFFFFFFFFFFFFF#4.5255518#0$14500000000000000000#1450FFFFFFFFFFFFFFFF#5.8377056#0$14550000000000000000#1455FFFFFFFFFFFFFFFF#1.0593816#0$145D0000000000000000#145DFFFFFFFFFFFFFFFF#0.56536752#0$145E0000000000000000#145EFFFFFFFFFFFFFFFF#5.5715914#0$145F0000000000000000#145FFFFFFFFFFFFFFFFF#5.1852665#0$14890000000000000000#1489FFFFFFFFFFFFFFFF#4.7883887#0$148B0000000000000000#148BFFFFFFFFFFFFFFFF#4.0083427#0$148D0000000000000000#148DFFFFFFFFFFFFFFFF#4.2748389#0$14A20000000000000000#14A2FFFFFFFFFFFFFFFF#2.5335326#0$14A30000000000000000#14A3FFFFFFFFFFFFFFFF#0.49951631#0$14A40000000000000000#14A4FFFFFFFFFFFFFFFF#0.20239893#0$15050000000000000000#1505FFFFFFFFFFFFFFFF#5.1517859#0$15060000000000000000#1506FFFFFFFFFFFFFFFF#5.0782852#0$15080000000000000000#1508FFFFFFFFFFFFFFFF#5.1046596#0$150A0000000000000000#150AFFFFFFFFFFFFFFFF#5.956018#0$150B0000000000000000#150BFFFFFFFFFFFFFFFF#5.5131192#0$150E0000000000000000#150EFFFFFFFFFFFFFFFF#7.9094634#0$15130000000000000000#1513FFFFFFFFFFFFFFFF#2.4158859#0$15160000000000000000#1516FFFFFFFFFFFFFFFF#6.2721028#0$15300000000000000000#1530FFFFFFFFFFFFFFFF#5.1649842#0$15310000000000000000#1531FFFFFFFFFFFFFFFF#10.021214#0$15500000000000000000#1550FFFFFFFFFFFFFFFF#5.9148378#0$15550000000000000000#1555FFFFFFFFFFFFFFFF#6.23703#0$155D0000000000000000#155DFFFFFFFFFFFFFFFF#5.7262163#0$155E0000000000000000#155EFFFFFFFFFFFFFFFF#5.3999515#0$155F0000000000000000#155FFFFFFFFFFFFFFFFF#6.4454427#0$15A20000000000000000#15A2FFFFFFFFFFFFFFFF#5.5776134#0$15A30000000000000000#15A3FFFFFFFFFFFFFFFF#0.41121209#0$15A40000000000000000#15A4FFFFFFFFFFFFFFFF#7.1952128#0\"  -separate_alpha \"\" -possion_select_ratio \"1\"",
        "sourceLocations": [
          "hdfs://NameNode2-VIP.Yarn-Prod-CO4.CO4.ap.gbl:443/HBaseMaster",
          "/HBaseCommon"
        ],
        "resource": {
          "cpuNumber": 18,
          "memoryMB": 19,
          "diskType": 1,
          "diskMB": 21
        }
      }
    },
    "HBaseRegionServer": {
      "taskNumber": 24,
      "priority": 25,
      "scaleUnitNumber": 26,
      "scaleUnitTimeoutSec": 27,
      "taskRetryPolicy": {
        "maxRetryCount": -2,
        "fancyRetryPolicy": false
      },
      "taskService": {
        "version": 33,
        "entryPoint": "HBaseRegionServer/start.bat",
        "sourceLocations": [
          "hdfs://NameNode2-VIP.Yarn-Prod-CO4.CO4.ap.gbl:443/HBaseRegionServer",
          "/HBaseCommon"
        ],
        "resource": {
          "cpuNumber": 28,
          "memoryMB": 29,
          "gpuNumber": 2,
          "gpuAttribute": 3,
          "portDefinitions": {
            "ssh": {
              "count": 1
            },
            "http": {
              "count": 2
            }
          },
          "diskType": 0,
          "diskMB": 31
        }
      }
    }
  },
  "platformSpecificParameters": {
    "amResource": {
      "cpuNumber": 1,
      "memoryMB": 1024,
      "diskType": 0,
      "diskMB": 0
    },
    "amNodeLabel": null,
    "taskNodeLabel": null,
    "queue": "default",
    "containerConnectionMaxLostCount": -2,
    "containerConnectionMaxExceedCount": 2,
    "antiaffinityAllocation": true,
    "gangAllocation": false,
    "amType": 0,
    "agentUseHeartbeat": false,
    "agentHeartbeatIntervalSec": 30,
    "agentExpiryIntervalSec": 180,
    "agentUseHealthCheck": false
  }
}

For more, check: https://github.com/Microsoft/pai/blob/master/subprojects/frameworklauncher/yarn/doc/USERMANUAL.md#put-framework

mzmssg commented 5 years ago

@jnpr-yhzhan

It's not hardcode, but generated according to your hardware. Your work node is 16G, the available memory should be 15G+, then some memory will be reserved for os and PAI (If you are using old version PAI, it's 12G. so avialbale memory will be 15G-12G = 3G).

refer to: https://github.com/Microsoft/pai/blob/c697d14ff675ebf6322850d0a2277e1f35262a8c/src/hadoop-node-manager/deploy/hadoop-node-manager-configuration/nodemanager-generate-script.sh#L72 and https://github.com/Microsoft/pai/blob/c697d14ff675ebf6322850d0a2277e1f35262a8c/src/hadoop-node-manager/deploy/hadoop-node-manager-configuration/yarn-site.xml#L26

mzmssg commented 5 years ago

If you want a better experience in using PAI, a better hardware is necessary. If you just want a quick evaluation, reducing the reserved memory would be a workaround.

jnpr-yhzhan commented 5 years ago

The workaround works, thanks @mzmssg

fanyangCS commented 5 years ago

close the issue as it is resolved.