Closed mzhaoshuai closed 4 years ago
The repo
https://github.com/rishizek/tensorflow-deeplab-v3
also use tf.dynamic_partition
.
@mzhaoshuai 我在运行deeplabv3的evaluate.py时候报以下的错:tensorflow.python.framework.errors_impl.InvalidArgumentError:Number of ways to split should evenly divide the split dimension,but got split_dim 2 (size=1) and num_split 3 [[{{node split}} = Split[T=DT_FLOAT, num_split=3, _device="/device:CPU:0"](split/split_dim, ToFloat)]] [{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,?,?,3],[?,?,?,1]], output_types=[DT_FLOAT,DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 能帮忙看看怎么回事吗,谢谢了
@chenhust1995
If you want to communicate on the GitHub, please speak English...
I do not know what's wrong with your program. If you do not change the source code, it will be totally ok.
From the error information,
I guess you write a multi-gpu version of the code and you may add some split
op when
you deal with the input data.
May you can check the corresponding code.
There are also may similar question, you can google the error info. https://github.com/tensorflow/tensor2tensor/issues/266
I thought you were Chinese,I also come from Hua Zhong University of Science and Techonoloy,so I used Chinese to communicate with u,I used this model with coco-stuff dataset,it can create TFRecord dataset and can be trained,when i evaluate this model,error happend on it
@chenhust1995 Sorry to reply you late. I have some stuff today. I know you are my alumnus from your nickname.
Do you solve the problem?
[[{{node split}} = Split[T=DT_FLOAT, num_split=3, _device="/device:CPU:0"](split/split_dim, ToFloat)]]
[{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,?,?,3],[?,?,?,1]], output_types=[DT_FLOAT,DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The info show you may place some split
ops on the cpu device,
and it is related to the iterator.
The iterator is create when you feed the data to the model.
Can you check the code about the input pipeline.
And you can also check is there some dirty records in you test tfrecord.
I do not have your code, so I can only give these suggestions.
I try to implement a multi gpus version of your code And then, I found that
tf.dynamic_partition
may cause NaN loss when use it with multi gpus and it performs normally with single gpu.I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918
So if someone try to use multi gpus, remember to replace the
tf.dynamic_partition
withtf.boolean_mask
.Finally, thanks for the owner's work!
Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot
I try to implement a multi gpus version of your code And then, I found that
tf.dynamic_partition
may cause NaN loss when use it with multi gpus and it performs normally with single gpu. I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918 So if someone try to use multi gpus, remember to replace thetf.dynamic_partition
withtf.boolean_mask
. Finally, thanks for the owner's work!Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot
@SkeletonOne Sorry to tell you that I can not give my code to you now. But I can show you that how I implement it:
One simple way is to use the tf.contrib.estimator.replicate_model_fn
function,
you can find some help in
https://stackoverflow.com/questions/47223766/how-to-run-tensorflow-estimator-on-multiple-gpus-with-data-parallelism/47599805
https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/replicate_model_fn
You need to replace the tf.dp
with tf.boolean_mask
.
The other way is to deploy the variables manually. You can find some help in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py https://www.tensorflow.org/api_docs/python/tf/Variable An snippet code can find in this issue https://github.com/tensorflow/tensorflow/issues/23918 You can deploy your model function like the snippet code in this issue.
You can try first way...it only need to change several codes. If it do not work, try the second way.
Besides, distribute strategy is not compatible with slim
library.
See the issue https://github.com/tensorflow/tensorflow/issues/23770 .
I try to implement a multi gpus version of your code And then, I found that
tf.dynamic_partition
may cause NaN loss when use it with multi gpus and it performs normally with single gpu. I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918 So if someone try to use multi gpus, remember to replace thetf.dynamic_partition
withtf.boolean_mask
. Finally, thanks for the owner's work!Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot
@SkeletonOne Sorry to tell you that I can not give my code to you now. But I can show you that how I implement it:
One simple way is to use the
tf.contrib.estimator.replicate_model_fn
function, you can find some help in https://stackoverflow.com/questions/47223766/how-to-run-tensorflow-estimator-on-multiple-gpus-with-data-parallelism/47599805 https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/replicate_model_fn You need to replace thetf.dp
withtf.boolean_mask
.The other way is to deploy the variables manually. You can find some help in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py https://www.tensorflow.org/api_docs/python/tf/Variable An snippet code can find in this issue tensorflow/tensorflow#23918 You can deploy your model function like the snippet code in this issue.
You can try first way...it only need to change several codes. If it do not work, try the second way.
Besides, distribute strategy is not compatible with
slim
library. See the issue tensorflow/tensorflow#23770 .
I sincerely appreciate for your detailed reply. However, after used tf.contrib.estimator.replicate_model_fn function and replace tf.dp with tf.boolean_mask, it still doesn't work. I think the problem is that I didn't change the Estimater part in train.py. I reimplemented your submission to tensorflow and it seems that with your training part: def train(api_sel=0):
variable_strategy = 'CPU'
input_device = '/cpu:0'
#var_device = '/gpu:0'
var_device = '/cpu:0'
num_devices = num_gpus
device_type = 'gpu'
with tf.Graph().as_default() as graph:
with tf.device(var_device):
## some collector
tower_ce_loss = []
global_step = tf.train.get_or_create_global_step()
name_scopes = ['tower_%d' % i for i in range(num_devices)]
for i in range(num_devices):
with tf.variable_scope(tf.get_variable_scope(), reuse=bool(i > 0)):
worker_device = '/{0}:{1}'.format(device_type, i)
with tf.name_scope(name_scopes[i]) as name_scope:
with tf.device(worker_device):
images = tf.ones([8, 321, 321, 3])
labels = tf.zeros([8, 321, 321, 1], dtype=tf.int32)
tower_model_fn(images, labels, api_sel=api_sel)
ce_now = tf.get_collection(tf.GraphKeys.LOSSES, scope=name_scope)
tower_ce_loss.append(tf.add_n(ce_now))
with tf.device(var_device):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, name_scopes[0])
ce_loss = tf.add_n(tower_ce_loss)
session_config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False
)
# Build an initialization operation to run below.
init_op = tf.global_variables_initializer()
max_steps = 30
step_gap_init_time = 0.0
with tf.Session(config=session_config) as sess:
sess.run(init_op)
for steps in range(1, max_steps + 1, 1):
step_gap_init_time = time.time()
c_l = sess.run([ce_loss])
if steps % 10 == 0:
gap_time = (time.time() - step_gap_init_time) / 10
print("ce loss{0}, {1:1.4f}s per steps".format(c_l, gap_time))
tf.keras.backend.clear_session()
tf.reset_default_graph()
it works well, but I have no idea how to change the code in the main(unused_argv) part in train.py. Could you give me a detailed guide? Very much thanks.
@SkeletonOne Do you replace the optimizer with the code ?
optimizer = replicate_model_fn.TowerOptimizer(optimizer)
replicate_model_fn
is compatible with tf.estimator
.
The offical deeplab code also use it
https://github.com/tensorflow/models/blob/master/research/deeplab/train.py
My snippet code can work but it is not easy to rewrite the code in that way.
I still suggest you to try the replicate_model_fn
...
Another example I suggest is https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py
The main process is here
1. The data need to be splited and place on each gpu.
2. Place the model on each gpu. Be careful about the `reuse` argument, See `cifar10_main.py` for details.
3. You need to average the loss, or sum the loss and average the gradient. Be careful about the BN layers and their `update_ops`. See `cifar10_main.py` for details.
4. Update the variables.
5. Loop 1-4.
Recently, I have a deadline... so you know=_=||
I try to implement a multi gpus version of your code And then, I found that
tf.dynamic_partition
may cause NaN loss when use it with multi gpus and it performs normally with single gpu.I report this bug to tensorflow, a reproducible example can be found here https://github.com/tensorflow/tensorflow/issues/23918
So if someone try to use multi gpus, remember to replace the
tf.dynamic_partition
withtf.boolean_mask
.Finally, thanks for the owner's work!