intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

Unable to train pytorch-based model using k8s mode #3408

Open gganduu opened 2 years ago

gganduu commented 2 years ago

My az k8s mode configuration is :

model = model_creator(None)
compute_loss = loss_creator(None)
optimizer = optim_creator(model, None)
train_loader = train_loader_creator(None, batch_size)
val_loader = val_data_creator(None, batch_size)

init_orca_context(
                    cluster_mode="k8s", 
                    master="k8s://https://172.16.212.214:6443",
                    container_image="ielym/test:az-k8s-v2",
                    num_nodes=2, 
                    memory="30g", 
                    cores=8,
                    conf={
                         "spark.driver.host": "172.16.212.214",
                         "spark.driver.port": "54323",
                         "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName":"nfsvolumeclaim",
                         "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path": "/zoo/",
                         "spark.kubernetes.executor.label.aztest": "1"
                    }
        )

est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size, validation_data=val_data_creator)

When I tried to running this code, an error was caused:

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:319) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:283) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:168) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:112) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:200) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

But with the same pytorch yolov5 code, I can successfully train it using az local mode:

model = model_creator(None)
compute_loss = loss_creator(None)
optimizer = optim_creator(model, None)
train_loader = train_loader_creator(None, batch_size)
val_loader = val_data_creator(None, batch_size)

init_orca_context(cluster_mode="local", cores=8, num_nodes=1, memory='30g', init_ray_on_spark=False, object_store_memory='30g')
est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")

est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)

There are two nodes of k8s, and one for controller node, the other for a work node. Under the same k8s env, I can train tf-based yolov3 without error.

glorysdj commented 2 years ago

please check if the kubeconfig is mounted/set, the spark version of driver and the executor image it may be related to misload kubeconfig or wrong version of okhttp/kubernetes-client

glorysdj commented 2 years ago

@gganduu have you fixed this issue?