microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

'RuntimeError: AddNodeToNet: Duplicated name for Plus2618 Plus operation' when reloading/retraining model #3385

Open dmagee opened 6 years ago

dmagee commented 6 years ago

This is a strange error, as it only appears when a) reload and retrain a model b) when a particular line is added to the model..

Error:

Traceback (most recent call last):
  File "train_cntk_unet.py", line 125, in <module>
    t=train(True)
  File "train_cntk_unet.py", line 91, in train
    trainer.train_minibatch(data)
  File "c:\Python27\lib\site-packages\cntk\train\trainer.py", line 181, in train_minibatch
    arguments, device)
  File "c:\Python27\lib\site-packages\cntk\cntk_py.py", line 3024, in train_minibatch_overload_for_minibatchdata
    return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, *args)
RuntimeError: AddNodeToNet: Duplicated name for Plus2618 Plus operation.

[CALL STACK]
    > std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this
    - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this
    - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD (x14)

model (snippet):

   conv1 = ConvReLULayerTanCA(left_norm, ks, nfmaps1,150000,name="c1a")
    conv1b = ConvReLULayerTanCA(conv1,ks,nfmaps1,nfmaps1,name="c1b")
    conv1 = C.plus(conv1,conv1b,name='p1a')
    conv1b = ConvReLULayerTanCA(conv1,ks,nfmaps1,nfmaps1,name="c1c")
    conv1 = C.plus(conv1,conv1b,name='p1b')    
    conv1b = ConvReLULayerTanCA(conv1,ks,nfmaps1,nfmaps1,name="c1d")
    conv1 = C.plus(conv1,conv1b,name='p1c')  
    conv1b = ConvReLULayerTanCA(conv1,ks,nfmaps1,nfmaps1,name="c1e")
    conv1 = C.plus(conv1,conv1b,name='p1d')      

    pool1 = MaxPooling((2,2), strides=(2,2))(conv1)

    conv2 = ConvReLULayerTanCA(pool1, ks, nfmaps2,nfmaps1,name="c2a")
    conv2b = ConvReLULayerTanCA(conv2,ks,nfmaps2,nfmaps2,name="c2b")
    conv2 = C.plus(conv2,conv2b,name='p2a')
    conv2b = ConvReLULayerTanCA(conv2,ks,nfmaps2,nfmaps2,name="c2c")
    conv2 = C.plus(conv2,conv2b,name='p2b')
    conv2b = ConvReLULayerTanCA(conv2,ks,nfmaps2,nfmaps2,name="c2d")
    conv2 = C.plus(conv2,conv2b,name='p2c')  
    conv2b = ConvReLULayerTanCA(conv2,ks,nfmaps2,nfmaps2,name="c2e")
    conv2 = C.plus(conv2,conv2b,name='p2d')  

    pool2 = MaxPooling((2,2), strides=(2,2))(conv2)

    conv3 = ConvReLULayerTanCA(pool2, ks, nfmaps3,nfmaps2,name="c3a")
    conv3b = ConvReLULayerTanCA(conv3,ks,nfmaps3,nfmaps3,name="c3b")
    conv3 = C.plus(conv3,conv3b,name='p3a')
    conv3b = ConvReLULayerTanCA(conv3,ks,nfmaps3,nfmaps3,name="c3c")
    conv3 = C.plus(conv3,conv3b,name='p3b')
    conv3b = ConvReLULayerTanCA(conv3,ks,nfmaps3,nfmaps3,name="c3d")
    conv3 = C.plus(conv3,conv3b,name='p3c') 
    conv3b = ConvReLULayerTanCA(conv3,ks,nfmaps3,nfmaps3,name="c3e")
    conv3 = C.plus(conv3,conv3b,name='p3d')        

    pool3 = MaxPooling((2,2), strides=(2,2))(conv3)

    conv4 = ConvReLULayerTanCA(pool3, ks, nfmaps4,nfmaps3,name="c4a")
    conv4b = ConvReLULayerTanCA(conv4,ks,nfmaps4,nfmaps4,name="c4b")
    conv4 = C.plus(conv4,conv4b,name='p4a')
    # Uncommenting the following 2 lines gives error on reload/retrain:
    # File "c:\Python27\lib\site-packages\cntk\cntk_py.py", line 3024, in train_minibatch_overload_for_minibatchdata
    #  return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, *args)
    #    RuntimeError: AddNodeToNet: Duplicated name for Plus2618 Plus operation.
    #conv4b = ConvReLULayerTanCA(conv4,ks,nfmaps4,nfmaps4,name="c4c")
    #conv4 = C.plus(conv4,conv4b,name='p4b')       

    #conv4b = ConvReLULayerTanCA(conv4,ks,nfmaps4,nfmaps4,name="c4d")
    #conv4 = C.plus(conv4,conv4b,name='p4c')    
    #conv4b = ConvReLULayerTanCA(conv4,ks,nfmaps4,nfmaps4,name="c4e")
    #conv4 = C.plus(conv4,conv4b,name='p4d')       

    pool4 = MaxPooling((2,2), strides=(2,2))(conv4)

    conv5 = ConvReLULayerTanCA(pool4, ks, nfmaps5,nfmaps4,name="c5a")
    conv5b = ConvReLULayerTanCA(conv5,ks,nfmaps5,nfmaps5,name="c5b")
    conv5 = C.plus(conv5,conv5b,name='p5a')
    conv5b = ConvReLULayerTanCA(conv5,ks,nfmaps5,nfmaps5,name="c5c")
    conv5 = C.plus(conv5,conv5b,name='p5b')
    conv5b = ConvReLULayerTanCA(conv5,ks,nfmaps5,nfmaps5,name="c5d")
    conv5 = C.plus(conv5,conv5b,name='p5c')   
    conv5b = ConvReLULayerTanCA(conv5,ks,nfmaps5,nfmaps5,name="c5e")
    conv5 = C.plus(conv5,conv5b,name='p5d')        

Whats really odd is: a) The code above works fine on initial model creation and re-load/re-train b) Uncomment two lines to add one more resnet layer (exactly the same as the other layers) you get the errors c) I've named all layers explicitly, but it still seems to be using arbitrary names?

Model Loading or creation is as:

    if use_existing:
        model=C.load_model("final_model")
        net_output = model(model.find_by_name('features'))
        features = model.find_by_name('features')
    else:     
        features = C.input_variable(shape,np.float32, name = 'features')
        model = cntk_unet.create_model(features,6)    
        net_output = model(features)

it trains fine when model is created from scratch, but only the simpler version works if use_existing=True.

Environment: cntk 2.5.1 (latest installed via pip today), python 2.7.14, windows 10.

Thanks!

dmagee commented 6 years ago

The error function is:

def dice_coefficient(x, y):
    # average of per-channel dice coefficient
    # global dice coefificnet doesn't work as class with larger region dominates the metrics
    # https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
    intersection = C.reduce_sum(x * y, axis=(1,2))

    denominator = C.plus(C.reduce_sum(C.relu(x), axis=(1,2)),C.reduce_sum(C.relu(y), axis=(1,2)),name="dcp")
    denominator1 = C.plus(denominator,1,name="dcp2")

    return C.reduce_mean(2.0 * intersection / (denominator1))

Which is created from scratch whether or not the model is loaded.