intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

RuntimeError: List process with list2cmdline failed! #9218

Open SjeYinTeoIntel opened 10 months ago

SjeYinTeoIntel commented 10 months ago

WARNING:root:Some values of column volume exceeds the mean plus/minus 10 times standard deviation, please call .repair_abnormal_data() to remove abnormal values. ds y volume high low open id 0 2013-02-08 00:00:00.000 14.75 8407500.0 15.12 14.63 15.07 0 1 2013-02-09 10:49:36.152 14.46 8882000.0 15.01 14.26 14.89 0 2 2013-02-10 21:39:12.304 14.27 8126000.0 14.51 14.10 14.45 0 3 2013-02-12 08:28:48.456 14.66 10259500.0 14.94 14.25 14.30 0 4 2013-02-13 19:18:24.608 13.99 31879900.0 14.96 13.16 14.94 0 Initializing orca context WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2023-10-18 09:44:28,023 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-10-18 09:44:28,026 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-10-18 09:44:28,026 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2023-10-18 09:44:28,027 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 23-10-18 09:44:28 [Thread-4] INFO Engine$:122 - Auto detect executor number and executor cores number 23-10-18 09:44:28 [Thread-4] INFO Engine$:124 - Executor number is 1 and executor cores number is 6 23-10-18 09:44:28 [Thread-4] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 19 23/10/18 09:44:28 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect. 23-10-18 09:44:28 [Thread-4] INFO Engine$:461 - Find existing spark context. Checking the spark conf... 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.reduceLocality.enabled. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.shuffle.blockTransferService. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.minRegisteredResourcesRatio. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.scheduler.maxRegisteredResourcesWaitingTime. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.speculation. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:467 - Engine.init: Can not find spark.serializer. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ 23-10-18 09:44:28 [Thread-4] WARN Engine$:470 - Engine.init: spark.driver.extraJavaOptions should be -Dlog4j2.info, but it is -Dcom.amazonaws.services.s3.enableV4=true. For details please check https://bigdl-project.github.io/master/#APIGuide/Engine/ cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Done init_orca_context Launching Ray on cluster with Spark barrier mode Start to launch ray driver Executing command: ray start --address 10.0.226.35:62449 --num-cpus 0 --node-ip-address 10.0.255.50 2023-10-18 09:44:43,021 INFO scripts.py:747 -- Local node IP: 10.0.255.50 2023-10-18 09:44:43,125 SUCC scripts.py:755 -- -------------------- 2023-10-18 09:44:43,125 SUCC scripts.py:756 -- Ray runtime started. 2023-10-18 09:44:43,125 SUCC scripts.py:757 -- -------------------- 2023-10-18 09:44:43,125 INFO scripts.py:759 -- To terminate the Ray runtime, run 2023-10-18 09:44:43,125 INFO scripts.py:760 -- ray stop

2023-10-18 09:44:43,036 WARNING services.py:1816 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.92gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. [2023-10-18 09:44:43,124 I 205 205] global_state_accessor.cc:360: This node has an IP address of 10.0.255.50, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

2023-10-18 09:44:43,021 INFO scripts.py:747 -- Local node IP: 10.0.255.50 2023-10-18 09:44:43,125 SUCC scripts.py:755 -- -------------------- 2023-10-18 09:44:43,125 SUCC scripts.py:756 -- Ray runtime started. 2023-10-18 09:44:43,125 SUCC scripts.py:757 -- -------------------- 2023-10-18 09:44:43,125 INFO scripts.py:759 -- To terminate the Ray runtime, run 2023-10-18 09:44:43,125 INFO scripts.py:760 -- ray stop

File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_daemon.py", line 26 logging.info(f"Stopping pgid {pgid} by ray_daemon.") ^ SyntaxError: invalid syntax 2023-10-18 09:44:44,277 INFO worker.py:842 -- Connecting to existing Ray cluster at address: 10.0.226.35:62449 {'node_ip_address': '10.0.255.50', 'raylet_ip_address': '10.0.255.50', 'redis_address': '10.0.226.35:62449', 'object_store_address': '/tmp/ray/session_2023-10-18_09-44-32_155648_187/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-10-18_09-44-32_155648_187/sockets/raylet', 'webui_url': '10.0.226.35:8265', 'session_dir': '/tmp/ray/session_2023-10-18_09-44-32_155648_187', 'metrics_export_port': 54656, 'node_id': '63d64a2f0bf4080aa38cb1a38eace36f89d1121623f6a8a3d7dc219e'} ERROR:bigdl.dllib.utils.log4Error:

****Usage Error**** List process with list2cmdline failed! ERROR:bigdl.dllib.utils.log4Error:

****Call Stack*** 2023-10-18 09:44:44,421 - DataTransformation - MainThread - ERROR - Exception in processing job: 284TC20-_AutoProphet_autoprophet1_5MXI4YR Exception: List process with list2cmdline failed! Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 66, in kill_redundant_log_monitors cmdline = subprocess.list2cmdline(proc.cmdline()) File "/usr/local/lib/python3.9/dist-packages/psutil/init.py", line 702, in cmdline return self._proc.cmdline() File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1650, in wrapper return fun(self, *args, kwargs) File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1788, in cmdline self._raise_if_zombie() File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie raise ZombieProcess(self.pid, self._name, self._ppid) psutil.ZombieProcess: PID still exists but it's a zombie (pid=311, name='python')

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1349, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1086, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 605, in init kill_redundant_log_monitors(self._address_info["redis_address"]) # type: ignore File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 77, in kill_redundant_log_monitors invalidInputError(False, "List process with list2cmdline failed!") File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: List process with list2cmdline failed! ERROR:DataTransformation:Exception in processing job: 284TC20-_AutoProphet_autoprophet1_5MXI4YR Exception: List process with list2cmdline failed! Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 66, in kill_redundant_log_monitors cmdline = subprocess.list2cmdline(proc.cmdline()) File "/usr/local/lib/python3.9/dist-packages/psutil/init.py", line 702, in cmdline return self._proc.cmdline() File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1650, in wrapper return fun(self, args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1788, in cmdline self._raise_if_zombie() File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie raise ZombieProcess(self.pid, self._name, self._ppid) psutil.ZombieProcess: PID still exists but it's a zombie (pid=311, name='python')

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/easydata-app/python/operation/data_transformation.py", line 1349, in run File "/opt/easydata-app/python/transformation_analytics/ml_model_train_test.py", line 1086, in AutoProphet_Forecaster File "/usr/local/lib/python3.9/dist-packages/bigdl/chronos/autots/model/auto_prophet.py", line 112, in init self.auto_est = AutoEstimator(model_builder=model_builder, File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/auto_estimator.py", line 53, in init self.searcher = SearchEngineFactory.create_engine( File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/init.py", line 25, in create_engine return RayTuneSearchEngine(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 53, in init self.remote_dir = remote_dir or RayTuneSearchEngine.get_default_remote_dir(name) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 60, in get_default_remote_dir ray_ctx = OrcaRayContext.get() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 103, in get ray_ctx.init() File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/raycontext.py", line 77, in init results = self._ray_on_spark_context.init(driver_cores=driver_cores) File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 605, in init kill_redundant_log_monitors(self._address_info["redis_address"]) # type: ignore File "/usr/local/lib/python3.9/dist-packages/bigdl/orca/ray/ray_on_spark_context.py", line 77, in kill_redundant_log_monitors invalidInputError(False, "List process with list2cmdline failed!") File "/usr/local/lib/python3.9/dist-packages/bigdl/dllib/utils/log4Error.py", line 33, in invalidInputError raise RuntimeError(errMsg) RuntimeError: List process with list2cmdline failed! 2023-10-18 09:44:44,422 - DataTransformation - MainThread - INFO - Spark Session is stopped INFO:DataTransformation:Spark Session is stopped 23/10/18 09:44:44 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. (raylet, ip=10.0.226.35) /usr/local/lib/python3.9/dist-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (raylet, ip=10.0.226.35) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"): 2023-10-18 09:44:44,902 ERROR import_thread.py:89 -- ImportThread: Connection closed by server. 2023-10-18 09:44:44,911 ERROR worker.py:478 -- print_logs: Connection closed by server. 2023-10-18 09:44:44,911 ERROR worker.py:1247 -- listen_error_messages_raylet: Connection closed by server. Stopping ray_orca context

plusbang commented 10 months ago

Hi, @SjeYinTeoIntel , could you provide the bigdl-chronos and psutil version you used?

We have fixed similar errors in https://github.com/intel-analytics/BigDL/pull/9208. Please try to run pip install --pre --upgrade bigdl-chronos[pytorch, distributed].

SjeYinTeoIntel commented 10 months ago

Hi, below is my library installed using Dockerfile.

RUN apt-get update && \ apt-get install --no-install-recommends python-setuptools python-dev python3 libpq-dev gcc musl-dev postgresql-server-dev-all curl unzip -y && \ python3.9 -m pip install --upgrade pip && \ python3.9 -m pip install psycopg2 --no-cache-dir && \ python3.9 -m pip install python-dateutil && \ python3.9 -m pip install boto3 && \ python3.9 -m pip install sparknlp && \ python3.9 -m pip install pyarrow==9.0.0 && \ python3.9 -m pip install pystan==3.0.0 && \ python3.9 -m pip install prophet==1.1.3 && \ python3.9 -m pip install pyldavis && \ python3.9 -m pip install ydata_profiling==4.0.0 && \ python3.9 -m pip install pandas==1.4.4 && \ python3.9 -m pip install numpy==1.24.2 && \ python3.9 -m pip install unidecode==1.3.6 && \ python3.9 -m pip install matplotlib==3.7.0 && \ python3.9 -m pip install bigdl-spark3==2.4.0b20230912 && \ python3.9 -m pip install ray[default]==1.9.2 && \ python3.9 -m pip install protobuf && \ python3.9 -m pip install --ignore-installed PyYAML --pre --upgrade bigdl-chronos[pytorch]==2.4.0b20230912 && \ python3.9 -m pip install kmodes && \ python3.9 -m pip install tabulate && \ rm -r /root/.cache && \ rm -rf /var/lib/apt/lists/*

I updated to bigdl-chronos[pytorch, distributed] as ur mentioned. But still facing the same issue.

RUN apt-get update && \ apt-get install --no-install-recommends python-setuptools python-dev python3 libpq-dev gcc musl-dev postgresql-server-dev-all curl unzip -y && \ python3.9 -m pip install --upgrade pip && \ python3.9 -m pip install psycopg2 --no-cache-dir && \ python3.9 -m pip install python-dateutil && \ python3.9 -m pip install boto3 && \ python3.9 -m pip install sparknlp && \ python3.9 -m pip install pyarrow==9.0.0 && \ python3.9 -m pip install pystan==3.0.0 && \ python3.9 -m pip install prophet==1.1.3 && \ python3.9 -m pip install pyldavis && \ python3.9 -m pip install ydata_profiling==4.0.0 && \ python3.9 -m pip install pandas==1.4.4 && \ python3.9 -m pip install numpy==1.24.2 && \ python3.9 -m pip install unidecode==1.3.6 && \ python3.9 -m pip install matplotlib==3.7.0 && \ python3.9 -m pip install bigdl-spark3==2.4.0b20230912 && \ python3.9 -m pip install ray[default]==1.9.2 && \ python3.9 -m pip install protobuf && \ python3.9 -m pip install --ignore-installed PyYAML --pre --upgrade bigdl-chronos[pytorch, distributed]==2.4.0b20230912 && \ python3.9 -m pip install kmodes && \ python3.9 -m pip install tabulate && \ rm -r /root/.cache && \ rm -rf /var/lib/apt/lists/*

Regards, sjeyin

From: binbin Deng @.> Sent: Thursday, October 19, 2023 2:38 PM To: intel-analytics/BigDL @.> Cc: Teo, Sje Yin @.>; Mention @.> Subject: Re: [intel-analytics/BigDL] RuntimeError: List process with list2cmdline failed! (Issue #9218)

Hi, @SjeYinTeoIntelhttps://github.com/SjeYinTeoIntel , could you provide the bigdl-chronos and psutil version you used?

We have fixed similar errors in #9208https://github.com/intel-analytics/BigDL/pull/9208. Please try to run pip install --pre --upgrade bigdl-chronos[pytorch, distributed].

— Reply to this email directly, view it on GitHubhttps://github.com/intel-analytics/BigDL/issues/9218#issuecomment-1770155373, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASXCPFY3AK62HHNQB7RGXS3YADDFFAVCNFSM6AAAAAA6GODRICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZQGE2TKMZXGM. You are receiving this because you were mentioned.Message ID: @.**@.>>

plusbang commented 10 months ago

Hi, @SjeYinTeoIntel , according to the command provided by you, I think you need to use python3.9 -m pip install --ignore-installed PyYAML --pre --upgrade bigdl-chronos[pytorch,distributed] instead of python3.9 -m pip install --ignore-installed PyYAML --pre --upgrade bigdl-chronos[pytorch, distributed]==2.4.0b20230912. Otherwise, you still use the version 2.4.0b20230912 because you specify==2.4.0b20230912 .