Open hamham223 opened 1 year ago
Update on error 2, array not equal
The 2 predicted array are not equal because the input dataframes are not the same.
rdd_map = rdd.map(lambda x: (DenseVector(np.random.randn(1, ).astype(np.float32)),
int(np.random.randint(0, 2, size=())),
os.getpid()))
df = rdd_map.toDF(["feature", "label", "pid"]) # this is lazy by default
With the config
conf = {"spark.python.worker.reuse": "false"}
sc = init_orca_context(conf=conf)
Every time a new process is created, with the same random seed, df.collect()
always returns the same value, and therefore the predict
can give the same result.
before_res = trainer.predict(df, feature_cols=["feature"]).collect()
expect_res = np.concatenate([part["prediction"] for part in before_res])
trainer.load(os.path.join(temp_dir, "cifar10_savemodel"))
# continous predicting
after_res = trainer.predict(df, feature_cols=["feature"]).collect()
pred_res = np.concatenate([part["prediction"] for part in after_res])
assert np.array_equal(expect_res, pred_res)
Without the config
sc = init_orca_context()
Without the config, it will use a fixed number of Python workers (i.e. always produce random number in the same processes), so the dataframe will be different in two predicts and thus the results are not the same.
spark.python.worker.reuse
So for ray backend tests, as long as we can keep the dataframe being the same, the config is not needed? @sgwhat Could you please look into those uts?
https://github.com/intel-analytics/BigDL/pull/7948
This PR can serve as a reference.
Two errors only?
Collective ops are already configured.
Intra op parallelism cannot be modified after initialization.
Next Step: classify which ut can arise which kind of error, is it stable?
this ut can see the error Collective ops are already configured.
import numpy as np
import tensorflow as tf
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca.learn.tf2 import Estimator
from bigdl.orca import OrcaContext
def simple_model(config):
model = tf.keras.models.Sequential([tf.keras.layers.Dense(10, input_shape=(1,)),
tf.keras.layers.Dense(1)])
return model
def multi_output_model(config):
image_input_1 = tf.keras.Input(shape=(32, 32, 3), name="input_1")
image_input_2 = tf.keras.Input(shape=(32, 32, 3), name="input_2")
x1 = tf.keras.layers.Conv2D(3, 3)(image_input_1)
x1 = tf.keras.layers.GlobalMaxPooling2D()(x1)
x2 = tf.keras.layers.Conv2D(3, 3)(image_input_2)
x2 = tf.keras.layers.GlobalMaxPooling2D()(x2)
x = tf.keras.layers.concatenate([x1, x2])
score_output = tf.keras.layers.Dense(5, name="score_output")(x)
class_output = tf.keras.layers.Dense(5, name="class_output")(x)
model = tf.keras.Model(
inputs=[image_input_1, image_input_2], outputs=[score_output, class_output]
)
return model
def compile_args(config):
import tensorflow as tf
if "lr" in config:
lr = config["lr"]
else:
lr = 1e-3
args = {
"optimizer": tf.keras.optimizers.SGD(lr),
"loss": "mean_squared_error",
"metrics": ["mean_squared_error"]
}
return args
def model_creator(config):
model = simple_model(config)
model.compile(**compile_args(config))
return model
def test_dataframe_different_train_val():
sc = OrcaContext.get_spark_context()
rdd = sc.range(0, 100, numSlices=10)
spark = OrcaContext.get_spark_session()
from pyspark.ml.linalg import DenseVector
df = rdd.map(lambda x: (DenseVector(np.random.randn(1, ).astype(np.float32)),
int(np.random.randint(0, 2, size=())))).toDF(["feature", "label"])
val_rdd = sc.range(0, 20, numSlices=6)
val_df = val_rdd.map(lambda x: (DenseVector(np.random.randn(1, ).astype(np.float32)),
int(np.random.randint(0, 2, size=())))).toDF(["feature", "label"])
config = {
"lr": 0.2
}
trainer = Estimator.from_keras(
model_creator=model_creator,
verbose=True,
config=config,
workers_per_node=2,
backend="spark")
res = trainer.fit(df, epochs=1, batch_size=4, steps_per_epoch=25,
validation_data=val_df,
validation_steps=2,
feature_cols=["feature"],
label_cols=["label"])
res = trainer.evaluate(val_df, batch_size=4, num_steps=25, feature_cols=["feature"],
label_cols=["label"])
print("validation result: ", res)
res = trainer.predict(df, feature_cols=["feature"]).collect()
print("predict result: ", res)
trainer.shutdown()
sc = init_orca_context()
test_dataframe_different_train_val()
stop_orca_context()
ooops! ray estimator are re-using the runner
update:
Error Collective Ops are configured
seemly is due to task_id and pid is not consist.
Traceback:
spark_runner.py
Ln 332:
print("config tf worker in pid: "+str(os.getpid()))
print(os.environ["TF_CONFIG"])
self.strategy = tf.distribute.MultiWorkerMirroredStrategy() # get task_id from "TF_CONFIG"
⬇️
collective_all_reduce_strategy.py
Ln 464:
context.context().configure_collective_ops(
collective_leader=multi_worker_util.collective_leader(
cluster_spec, task_type, task_id),
scoped_allocator_enabled_ops=("CollectiveReduce",),
device_filters=("/job:%s/task:%d" % (task_type, task_id),))
⬇️
context.py
Line 876:
print("pid: "+str(os.getpid())) # those print are added by me
print("device filter to config: "+ str(device_filters))
print("device filter to be configed: "+ str(self._collective_device_filters))
if self._collective_leader is not None:
if (self._collective_leader != collective_leader or
self._collective_scoped_allocator_enabled_ops !=
scoped_allocator_enabled_ops or
self._collective_use_nccl_communication != use_nccl_communication or
self._collective_device_filters != device_filters):
print(device_filters)
print(self._collective_device_filters)
raise ValueError("Collective ops are already configured.")
else:
return
first time, (called by trainer.fit)
config tf worker in pid: 2957
{"cluster": {"worker": ["ip:59429", "ip:41471"]}, "task": {"type": "worker", "index": 0}}
pid: 2957 device filter to config: ('/job:worker/task:0',)
pid: 2957 device filter to be configed: None
config tf worker in pid: 2953
{"cluster": {"worker": ["ip:59429", "ip:41471"]}, "task": {"type": "worker", "index": 1}}
pid: 2953 device filter to config: ('/job:worker/task:1',)
pid: 2953 device filter to be configed: None
second time (called by trainer.validate)
config tf worker in pid: 2957
{"cluster": {"worker": ["ip:41973", "ip:58869"]}, "task": {"type": "worker", "index": 1}}
pid: 2957 device filter to config: ('/job:worker/task:1',)
pid: 2957 device filter to be configed: ('/job:worker/task:0',)
Problem
Some tensorflow training unit tests require the following extra config in
conftest.py
, especially for spark backend. Some ray backend uts also use this config.Typical Errors
Removing the above config may cause the following kinds of errors:
How-to reproduce the problem
remove the above config in:
and run orca pytests
Current Solutions