tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu

mzhaoshuai commented 5 years ago

I try to implement a multi gpus version of your code And then, I found that tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu.

I report this bug to tensorflow, a reproducible example can be found here https://github.com/tensorflow/tensorflow/issues/23918

So if someone try to use multi gpus, remember to replace the tf.dynamic_partition with tf.boolean_mask.

Finally, thanks for the owner's work!

mzhaoshuai commented 5 years ago

The repo https://github.com/rishizek/tensorflow-deeplab-v3 also use tf.dynamic_partition.

HanChen-HUST commented 5 years ago

@mzhaoshuai 我在运行deeplabv3的evaluate.py时候报以下的错：tensorflow.python.framework.errors_impl.InvalidArgumentError:Number of ways to split should evenly divide the split dimension,but got split_dim 2 (size=1) and num_split 3 [[{{node split}} = Split[T=DT_FLOAT, num_split=3, _device="/device:CPU:0"](split/split_dim, ToFloat)]] [{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,?,?,3],[?,?,?,1]], output_types=[DT_FLOAT，DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] 能帮忙看看怎么回事吗，谢谢了

mzhaoshuai commented 5 years ago

@chenhust1995

If you want to communicate on the GitHub, please speak English...

I do not know what's wrong with your program. If you do not change the source code, it will be totally ok.

From the error information, I guess you write a multi-gpu version of the code and you may add some split op when you deal with the input data. May you can check the corresponding code.

There are also may similar question, you can google the error info. https://github.com/tensorflow/tensor2tensor/issues/266

HanChen-HUST commented 5 years ago

I thought you were Chinese,I also come from Hua Zhong University of Science and Techonoloy，so I used Chinese to communicate with u,I used this model with coco-stuff dataset,it can create TFRecord dataset and can be trained，when i evaluate this model，error happend on it

mzhaoshuai commented 5 years ago

@chenhust1995 Sorry to reply you late. I have some stuff today. I know you are my alumnus from your nickname.

Do you solve the problem?

[[{{node split}} = Split[T=DT_FLOAT, num_split=3, _device="/device:CPU:0"](split/split_dim, ToFloat)]]
[{{node IteratorGetNext}} = IteratorGetNextoutput_shapes=[[?,?,?,3],[?,?,?,1]], output_types=[DT_FLOAT，DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

The info show you may place some split ops on the cpu device, and it is related to the iterator. The iterator is create when you feed the data to the model. Can you check the code about the input pipeline.

And you can also check is there some dirty records in you test tfrecord.

I do not have your code, so I can only give these suggestions.

SkeletonOne commented 5 years ago

I try to implement a multi gpus version of your code And then, I found that tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu.

I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918

So if someone try to use multi gpus, remember to replace the tf.dynamic_partition with tf.boolean_mask.

Finally, thanks for the owner's work!

Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot

mzhaoshuai commented 5 years ago

I try to implement a multi gpus version of your code And then, I found that tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu. I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918 So if someone try to use multi gpus, remember to replace the tf.dynamic_partition with tf.boolean_mask. Finally, thanks for the owner's work!

Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot

@SkeletonOne Sorry to tell you that I can not give my code to you now. But I can show you that how I implement it:

One simple way is to use the tf.contrib.estimator.replicate_model_fn function, you can find some help in https://stackoverflow.com/questions/47223766/how-to-run-tensorflow-estimator-on-multiple-gpus-with-data-parallelism/47599805 https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/replicate_model_fn You need to replace the tf.dp with tf.boolean_mask.

The other way is to deploy the variables manually. You can find some help in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py https://www.tensorflow.org/api_docs/python/tf/Variable An snippet code can find in this issue https://github.com/tensorflow/tensorflow/issues/23918 You can deploy your model function like the snippet code in this issue.

You can try first way...it only need to change several codes. If it do not work, try the second way.

Besides, distribute strategy is not compatible with slim library. See the issue https://github.com/tensorflow/tensorflow/issues/23770 .

SkeletonOne commented 5 years ago

I try to implement a multi gpus version of your code And then, I found that tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu. I report this bug to tensorflow, a reproducible example can be found here tensorflow/tensorflow#23918 So if someone try to use multi gpus, remember to replace the tf.dynamic_partition with tf.boolean_mask. Finally, thanks for the owner's work!

Hello, I'm now about to implement this code on two GPUs, however it is quite hard for me, so may Ihave your code submission? Thanks a lot

@SkeletonOne Sorry to tell you that I can not give my code to you now. But I can show you that how I implement it:

One simple way is to use the tf.contrib.estimator.replicate_model_fn function, you can find some help in https://stackoverflow.com/questions/47223766/how-to-run-tensorflow-estimator-on-multiple-gpus-with-data-parallelism/47599805 https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/replicate_model_fn You need to replace the tf.dp with tf.boolean_mask.

The other way is to deploy the variables manually. You can find some help in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py https://github.com/tensorflow/models/blob/master/official/resnet/imagenet_main.py https://www.tensorflow.org/api_docs/python/tf/Variable An snippet code can find in this issue tensorflow/tensorflow#23918 You can deploy your model function like the snippet code in this issue.

You can try first way...it only need to change several codes. If it do not work, try the second way.

Besides, distribute strategy is not compatible with slim library. See the issue tensorflow/tensorflow#23770 .

I sincerely appreciate for your detailed reply. However, after used tf.contrib.estimator.replicate_model_fn function and replace tf.dp with tf.boolean_mask, it still doesn't work. I think the problem is that I didn't change the Estimater part in train.py. I reimplemented your submission to tensorflow and it seems that with your training part: def train(api_sel=0):

variable strategy

variable_strategy = 'CPU'
input_device = '/cpu:0'
#var_device = '/gpu:0'
var_device = '/cpu:0'
num_devices = num_gpus
device_type = 'gpu'

with tf.Graph().as_default() as graph:
    with tf.device(var_device):
        ## some collector
        tower_ce_loss = []

        global_step = tf.train.get_or_create_global_step()  

        name_scopes = ['tower_%d' % i for i in range(num_devices)]
        for i in range(num_devices):
            with tf.variable_scope(tf.get_variable_scope(), reuse=bool(i > 0)):
                worker_device = '/{0}:{1}'.format(device_type, i)
                with tf.name_scope(name_scopes[i]) as name_scope:
                    with tf.device(worker_device):  

                        images = tf.ones([8, 321, 321, 3])
                        labels = tf.zeros([8, 321, 321, 1], dtype=tf.int32)
                        tower_model_fn(images, labels, api_sel=api_sel)
                        ce_now = tf.get_collection(tf.GraphKeys.LOSSES, scope=name_scope)
                        tower_ce_loss.append(tf.add_n(ce_now))

    with tf.device(var_device):
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, name_scopes[0]) 

        ce_loss = tf.add_n(tower_ce_loss)
        session_config = tf.ConfigProto(allow_soft_placement=True,
                                        log_device_placement=False
                                        )
        # Build an initialization operation to run below.
        init_op = tf.global_variables_initializer()
        max_steps = 30
        step_gap_init_time = 0.0
        with tf.Session(config=session_config) as sess:
            sess.run(init_op)
            for steps in range(1, max_steps + 1, 1):
                step_gap_init_time = time.time()
                c_l = sess.run([ce_loss])
                if steps % 10 == 0:
                    gap_time = (time.time() - step_gap_init_time) / 10
                    print("ce loss{0}, {1:1.4f}s per steps".format(c_l, gap_time))
tf.keras.backend.clear_session()
tf.reset_default_graph()

it works well, but I have no idea how to change the code in the main(unused_argv) part in train.py. Could you give me a detailed guide? Very much thanks.

mzhaoshuai commented 5 years ago

@SkeletonOne Do you replace the optimizer with the code ?

optimizer = replicate_model_fn.TowerOptimizer(optimizer)

replicate_model_fn is compatible with tf.estimator. The offical deeplab code also use it https://github.com/tensorflow/models/blob/master/research/deeplab/train.py

My snippet code can work but it is not easy to rewrite the code in that way. I still suggest you to try the replicate_model_fn...

Another example I suggest is https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py

The main process is here

1. The data need to be splited and place on each gpu.
2. Place the model on each gpu. Be careful about the `reuse` argument,  See `cifar10_main.py` for details.
3. You need to average the loss, or sum the loss and average the gradient. Be careful about the BN layers and their `update_ops`. See `cifar10_main.py` for details.
4. Update the variables. 
5. Loop 1-4.

Recently, I have a deadline... so you know=_=||

rishizek / tensorflow-deeplab-v3-plus

tf.dynamic_partition may cause NaN loss when use it with multi gpus and it performs normally with single gpu #36

variable strategy