intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.71k stars 1.26k forks source link

init_orca_context( ) fails when cluster_mode is yarn-cluster #4620

Open sgwhat opened 2 years ago

sgwhat commented 2 years ago

When I set cluster_mode to yarn-cluster in init_orca_context( ) and run as a python script, it fails with the following info:

WARN  ScriptBasedMapping:254 - Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 172.16.0.173 
ExitCodeException exitCode=1: Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
ValueError: character U+6374652f is not in range [U+0000; U+10ffff]

Current thread 0x00007f1287deb740 (most recent call first):

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
        at org.apache.hadoop.util.Shell.run(Shell.java:479)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
        at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
        at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
        at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
        at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:37)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:422)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:421)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:421)
sgwhat commented 2 years ago

Could you please take a look? @qiuxin2012

sgwhat commented 2 years ago

This error doesn't happen when I set cluster_mode to spark-submit and run with spark-submit command instead.

qiuxin2012 commented 2 years ago

I find some error message:

2022-05-17 15:28:15 ERROR ApplicationMaster:91 - User class threw exception: java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=2, No such file or directory
java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 7 more

2022-05-17 15:29:23 INFO ApplicationMaster:54 - Waiting for spark context initialization... 2022-05-17 15:29:23 ERROR ApplicationMaster:91 - User class threw exception: java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=13, Permission denied java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684) Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 7 more

qiuxin2012 commented 2 years ago

I find some error message:

2022-05-17 15:28:15 ERROR ApplicationMaster:91 - User class threw exception: java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=2, No such file or directory
java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=2, No such file or directory
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
  at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100)
  at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: java.io.IOException: error=2, No such file or directory
  at java.lang.UNIXProcess.forkAndExec(Native Method)
  at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
  at java.lang.ProcessImpl.start(ProcessImpl.java:134)
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
  ... 7 more

2022-05-17 15:29:23 INFO ApplicationMaster:54 - Waiting for spark context initialization... 2022-05-17 15:29:23 ERROR ApplicationMaster:91 - User class threw exception: java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=13, Permission denied java.io.IOException: Cannot run program "/home/manfei/anaconda3/envs/master/bin/python": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684) Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 7 more

This error is caused by a wrong environment PYSPARK_DRIVER_PYTHON.

qiuxin2012 commented 2 years ago

When I set cluster_mode to yarn-cluster in init_orca_context( ) and run as a python script, it fails with the following info:

WARN  ScriptBasedMapping:254 - Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 172.16.0.173 
ExitCodeException exitCode=1: Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
ValueError: character U+6374652f is not in range [U+0000; U+10ffff]

Current thread 0x00007f1287deb740 (most recent call first):

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
        at org.apache.hadoop.util.Shell.run(Shell.java:479)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
        at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
        at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
        at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
        at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:37)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:422)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1$$anonfun$run$1.apply(YarnAllocator.scala:421)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anon$1.run(YarnAllocator.scala:421)

This is a warning, failed to get the rack info of the node. It's not a blocking error, the job is still running.

qiuxin2012 commented 2 years ago

This error is caused by a wrong environment PYSPARK_DRIVER_PYTHON. PYSPARK_DRIVER_PYTHON should be ignored when use yarn-cluster mode.