Open amardeepjaiman opened 2 years ago
@qiuxin2012 could you help take a look at this issue?
Same issue as https://github.com/intel-analytics/BigDL/issues/4800?
@amardeepjaiman As the notebook is for colab, so you should have done a lots of changes. Could you tell us the detail steps? Including
I am using Azure Databricks Environment to execute it in notebook. to setup environment I am using a shell script so that the required enviornment and dependency jar files are available on all the nodes. Please find the content of the shell script below :
export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64' export PYTHONPATH='/databricks/python3' /databricks/python/bin/pip install bigdl-spark3 /databricks/python/bin/pip install torch==1.11.0+cpu torchvision==0.12.0+cpu six cloudpickle argparse tqdm matplotlib tensorboard -f https://download.pytorch.org/whl/torch_stable.html JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64' /databricks/python/bin/pip install jep
I am trying to runn the BigDL code in databricks notebook not command line. Please let me know if you need more information.
@PatrickkZ Please try to reproduce the error, or find a right way to run the notebook.
@amardeepjaiman, hi, we have already reproduce the same error on databricks. we are finding a way to solve this problem, We'll let you know when we have more information.
the same error
How long will it take?
Layer info: TorchModel[5d5e341e] jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108) at com.intel.analytics.bigdl.orca.net.TorchModel.updateOutput(TorchModel.scala:131) at com.intel.analytics.bigdl.dllib.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:283) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:272) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$5.apply(PythonInterpreter.scala:91) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$$anonfun$5.apply(PythonInterpreter.scala:90) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:90) ... 11 more
at com.intel.analytics.bigdl.dllib.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:289)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:272)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263)
at com.intel.analytics.bigdl.dllib.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@amardeepjaiman @xbinglzh I failed to enable jep backend pytorch estimator. But I run pyspark backend pytorch estimator successfully. See example https://github.com/intel-analytics/BigDL/blob/v2.0.0/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist.py You can try it with following configuration. Databricks's Init script:
#!/bin/bash
apt-get install openjdk-8-jdk-headless -qq > /dev/null
python -c "import os;os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'"
update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'
/databricks/python/bin/pip install numpy==1.22.3
/databricks/python/bin/pip install bigdl-orca-spark3 tqdm
/databricks/python/bin/pip install torch==1.11.0+cpu torchvision==0.12.0+cpu tensorboard -f https://download.pytorch.org/whl/torch_stable.html
/databricks/python/bin/pip install cloudpickle
cp /databricks/python/lib/python3.8/site-packages/bigdl/share/*/lib/*.jar /databricks/jars
Databricks's Spark Conf: (spark.executor.cores, spark.cores.max should match your cluster, my is one 4 cores executor.)
spark.driver.extraLibraryPath /databricks/python3/lib/
spark.cores.max 4
spark.executor.extraLibraryPath /databricks/python3/lib/
spark.executor.cores 4
You need to delete the argments parser in the notebook and use following arguments:
cluster_mode = "spark-submit"
runtime = "spark"
address=""
backend="spark"
batch_size=4
epochs=2
data_dir="./data"
download=True
You can use below code in your notebook directly:
from __future__ import print_function
import os
import argparse
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca.learn.pytorch import Estimator
from bigdl.orca.learn.metrics import Accuracy
from bigdl.orca.learn.trigger import EveryEpoch
def train_data_creator(config={}, batch_size=4, download=True, data_dir='./data'):
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
trainset = torchvision.datasets.FashionMNIST(root=data_dir,
download=download,
train=True,
transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=0)
return trainloader
def validation_data_creator(config={}, batch_size=4, download=True, data_dir='./data'):
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))])
testset = torchvision.datasets.FashionMNIST(root=data_dir, train=False,
download=download, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=0)
return testloader
# helper function to show an image
def matplotlib_imshow(img, one_channel=False):
if one_channel:
img = img.mean(dim=0)
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
if one_channel:
plt.imshow(npimg, cmap="Greys")
else:
plt.imshow(np.transpose(npimg, (1, 2, 0)))
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 4 * 4, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 4 * 4)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def model_creator(config):
model = Net()
return model
def optimizer_creator(model, config):
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
return optimizer
def main():
cluster_mode = "spark-submit"
runtime = "spark"
address=""
backend="spark"
batch_size=4
epochs=2
data_dir="./data"
download=True
if runtime == "ray":
init_orca_context(runtime=runtime, address=address)
else:
if cluster_mode == "local":
init_orca_context()
elif cluster_mode.startswith("yarn"):
init_orca_context(cluster_mode=cluster_mode, cores=4, num_nodes=2)
elif cluster_mode == "spark-submit":
init_orca_context(cluster_mode=cluster_mode)
tensorboard_dir = data_dir+"runs"
# constant for classes
classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot')
# plot some random training images
dataiter = iter(train_data_creator(config={}, batch_size=4,
download=download, data_dir=data_dir))
images, labels = dataiter.next()
# create grid of images
img_grid = torchvision.utils.make_grid(images)
# show images
matplotlib_imshow(img_grid, one_channel=True)
# training loss vs. epochs
criterion = nn.CrossEntropyLoss()
batch_size = batch_size
epochs = epochs
if backend == "bigdl":
train_loader = train_data_creator(config={}, batch_size=4,
download=download, data_dir=data_dir)
test_loader = validation_data_creator(config={}, batch_size=4,
download=download, data_dir=data_dir)
net = model_creator(config={})
optimizer = optimizer_creator(model=net, config={"lr": 0.001})
orca_estimator = Estimator.from_torch(model=net,
optimizer=optimizer,
loss=criterion,
metrics=[Accuracy()],
backend="bigdl")
orca_estimator.set_tensorboard(tensorboard_dir, "bigdl")
orca_estimator.fit(data=train_loader, epochs=epochs, validation_data=test_loader,
checkpoint_trigger=EveryEpoch())
res = orca_estimator.evaluate(data=test_loader)
print("Accuracy of the network on the test images: %s" % res)
elif backend in ["ray", "spark"]:
orca_estimator = Estimator.from_torch(model=model_creator,
optimizer=optimizer_creator,
loss=criterion,
metrics=[Accuracy()],
model_dir=os.getcwd(),
use_tqdm=True,
backend=backend)
stats = orca_estimator.fit(train_data_creator, epochs=epochs, batch_size=batch_size)
for stat in stats:
writer.add_scalar("training_loss", stat['train_loss'], stat['epoch'])
print("Train stats: {}".format(stats))
val_stats = orca_estimator.evaluate(validation_data_creator, batch_size=batch_size)
print("Validation stats: {}".format(val_stats))
orca_estimator.shutdown()
else:
raise NotImplementedError("Only bigdl, torch_distributed, and spark are supported "
"as the backend, but got {}".format(args.backend))
stop_orca_context()
main()
ok. let me try and get back to you.
Hi @qiuxin2012 ,
I tried to use the init script shared by you , but I am getting init script faliure while starting the databricks cluster. Which databricks runtime version you are using ? Please check the attached error snapshot and cluster configuration.
My is 9.1 LTS(include spark 3.1.2, Scala 2.12). See the image below.
@qiuxin2012
Cluster is up with init script. When i run the given source code with Spark backend, training seems to be started but getting following error with model save directory in save_pkl function.
Py4JJavaError Traceback (most recent call last)
Hi @qiuxin2012 ,
I was able to solve the above issue using the latest nightly build of bigdl-spark3 from BigDL repos. Now the training is running with above configuration where I have 1 min worker (with4 cores) assigned and training is running on single worker node. Now if I change the Min Worker in the databricks cluster configuration to 2. Now i have 2 worker nodes with 4 cores. So i changed the spark configuration spark.cores.max 4 to spark.cores.max 8. Ideally now training should be distributedly running on both worker nodes and using all 8 cores. But I get an exception while running this. Please find below the stacktrace below.
Py4JJavaError Traceback (most recent call last)
@amardeepjaiman Sorry for the late response. We have reproduced your new error, I will info you when we find a solution. The error will happen when number of executors >= 2.
@amardeepjaiman, hi, you can fix this by add a environment variable GLOO_SOCKET_IFNAME
.
execute !ifconfig
in the notebook
set GLOO_SOCKET_IFNAME
to your first Ethernet interface(mine is eth0
)
this works for me when I have 2 workers.
here is my init script
# use the latest version of orca
/databricks/python/bin/pip install --pre --upgrade bigdl-orca-spark3
/databricks/python/bin/pip install tqdm
/databricks/python/bin/pip install torch==1.11.0+cpu torchvision==0.12.0+cpu tensorboard -f https://download.pytorch.org/whl/torch_stable.html
/databricks/python/bin/pip install cloudpickle
cp /databricks/python/lib/python3.8/site-packages/bigdl/share/*/lib/*.jar /databricks/jars
As for the model_dir
problem, you can just leave it to None
elif backend in ["ray", "spark"]:
orca_estimator = Estimator.from_torch(model=model_creator,
optimizer=optimizer_creator,
loss=criterion,
metrics=[Accuracy()],
model_dir=None,
use_tqdm=True,
backend=backend)
if model_dir
is not None, it should be a path starts with /dbfs
or dbfs:
, but this won't work until this pr is merged. so you can try it later, for now, just leave it to None.
Hi, I am trying to run a Fashion MNIST sample code given in BigDL repo on Azure Databricks spark cluster environment. Sample code link is here : https://github.com/intel-analytics/BigDL/blob/main/python/orca/colab-notebook/examples/fashion_mnist_bigdl.ipynb
Cluster Configuration:
I have 1 Azure D4_V5 Based driver node and 2 Azure standard D4_V5 based worker nodes setup in my spark cluster. Azure Databricks Runtime : 9.1 LTS ML (Scala 2.12, Spark 3.1.2)
Spark Configuration is below : spark.executorEnv.PYTHONHOME /databricks/python3/lib/python3.8 spark.serializer org.apache.spark.serializer.JavaSerializer spark.executorEnv.KMP_BLOCKTIME 0 spark.databricks.delta.preview.enabled true spark.rpc.message.maxSize 2047 spark.executor.cores 3 spark.executor.memory 8g spark.files.fetchTimeout 100000s spark.network.timeout 100000s spark.databricks.conda.condaMagic.enabled true spark.driver.memory 8g spark.scheduler.minRegisteredResourcesRatio 1.0 spark.scheduler.maxRegisteredResourcesWaitingTime 60s spark.executor.heartbeatInterval 1000000 spark.cores.max 6 spark.default.parallelism 1000 spark.executorEnv.OMP_NUM_THREADS 1 spark.driver.cores 3
I create the estimator using orca_estimator = Estimator.from_torch(model=net, optimizer=optimizer, loss=criterion, metrics=[Accuracy()], backend="bigdl")
and geting exception in following line :
from bigdl.orca.learn.trigger import EveryEpoch orca_estimator.fit(data=trainloader, epochs=epochs, validation_data=testloader, checkpoint_trigger=EveryEpoch())
Please find below the full stacktrace of the error I am getting
jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.createInterpreter(PythonInterpreter.scala:82) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.init(PythonInterpreter.scala:63) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.check(PythonInterpreter.scala:56) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:104) at com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.$anonfun$loadPythonSet$1(PythonFeatureSet.scala:90) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:868) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:868) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380) at org.apache.spark.rdd.RDD.iterator(RDD.scala:344) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:91) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1657) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at scala.concurrent.Await$.$anonfun$result$1(package.scala:220) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:57) at scala.concurrent.Await$.result(package.scala:146) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.$anonfun$threadExecute$2(PythonInterpreter.scala:91) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:90) ... 28 more
org.apache.spark.rdd.RDD.count(RDD.scala:1263) com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.loadPythonSet(PythonFeatureSet.scala:86) com.intel.analytics.bigdl.orca.net.PythonFeatureSet.(PythonFeatureSet.scala:168)
com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.python(PythonFeatureSet.scala:61)
com.intel.analytics.bigdl.orca.net.python.PythonZooNet.createFeatureSetFromPyTorch(PythonZooNet.scala:283)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
py4j.Gateway.invoke(Gateway.java:295)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:251)
java.lang.Thread.run(Thread.java:748)
Please let me know if someone has already faced this issue in past.
Also requesting BigDL official team to support on this issue as i want to use the BigDL library for my deep learning training on Spark cluster for distributed training.
Thanks in Adavance