pumpikano / tf-dann

Domain-Adversarial Neural Network in Tensorflow
MIT License
628 stars 224 forks source link

Keras implementation? #14

Open michetonu opened 7 years ago

michetonu commented 7 years ago

Hey,

First of all thanks a lot for this. I was wondering whether there is an easy way to make the gradient flipping work in Keras. Someone has done it for the Theano backend, but not for the Tensorflow. Would it be feasible to combine the two?

Thanks!

pumpikano commented 7 years ago

I see two options, but I'm not knowledgeable enough with Keras to judge how easy they would be.

First, you could define a GradientReversal layer in Keras - this PR attempted that in the Keras 1.0 API, but was never finished. I believe this is feasible and would be best in terms of usability by other Keras users.

Second, I believe that Keras 2.0 and TF can interoperate relatively seamlessly, so you could use Keras to define most of your model and use TF for the gradient reversal portion. Here is a simple example of mixing the two. Again, I haven't used this myself, but it seems like a feasible option.

kilickaya commented 7 years ago

Adding to pumpikano's answer, you can implement gradient reversal layer as 'maximizing' the objective of interest rather than minimization. This can be implemented as minimizing the negative of the cost function in Tensorflow, and could be very easy with Keras also. So your objective may look like this:

min (object_cost - domain_cost)

where you favor good object recognition (object_cost) and hurt your model's ability to differentiate images from two different domains (domain_cost).

This leaves no need for explicitly implementing Gradient Reversal Layer, as it also achieves the same objective after all.

michetonu commented 7 years ago

@pumpikano Thanks, that looks feasible - I'll take a look and keep you updated.

@kilickaya That's true! Should I then basically just define that as a custom loss function for the domain adaptation part? Would that produce the same output as the paper?

kilickaya commented 7 years ago

In the paper, the Gradient Reversal Layer is used to go in reverse direction of Gradients. This can be achievable in two ways. You can either reverse the sign of gradients and minimize the same objective, or you can reverse the loss function, as I mentioned above. Implementation wise it is slightly different, but they serve for the same purpose (optimizes the same objective).

Either the authors did not recognize that while writing the paper, that this idea is that simple to implement, or they wanted to make it look more complex than necessary.

In their later manuscript, they also state that this can be implemented by maximization.

I don't know about Keras, but you can implement in Tensorflow like this (assume you use Adam optimizer):

optimizer = tf.train.AdamOptimizer(learning_rate=learningrate).minimize(object_cost - domain_cost, var_list= vars)

pumpikano commented 7 years ago

Just to clarify, you will need two alternating optimizations with this approach

min_D(domain_cost) min_P(object_cost - domain_cost)

where D is the domain prediction subnetwork, P is the class prediction subnetwork, and min_X means "minimize w.r.t. function X". This is a classic GAN setup.

It is worth noting that it also gives you the freedom to allow the domain cost to differ between the two optimizations. For instance, many GAN implementations actually minimize the discriminator objective with flipped labels rather than maximizing it with correct labels. This amounts to a different objective and usually works better in practice in GANS.

michetonu commented 7 years ago

Could I just define a custom loss such as:

def custom_categorical_crossentropy(y_true, y_pred):
    return -K.categorical_crossentropy(y_pred, y_true)

?

@pumpikano what exactly do you mean by 'alternating'? I would just use one cost function in the label branch and the other one in the domain subbranch, correct?

pumpikano commented 7 years ago

I just mean that usually you will take a step with one loss function and then a step with the other for each batch, usually updating the discriminator first.

michetonu commented 7 years ago

@pumpikano So I managed to implement the layer for the TF backend expanding on the link you provided, you can find it here: https://github.com/michetonu/gradient_inversion_keras_tf/blob/master/flipGradientTF.py

Regarding the loss function steps, Keras works with multi-output models, and I think the loss functions are just additive. Correct me if I'm wrong, but I think I can just create one model (it seems to work). The way I'm doing it is I'm feeding both distributions as input, but setting the target samples' loss weights to zero in the classifier, so that they don't get considered in the backprop update, while removing the need to alternate inputs. I'll happily share my code once I'm sure it works properly :)

pumpikano commented 7 years ago

Cool! Yeah that seems like a reasonable approach.

ghost commented 6 years ago

@michetonu Did you got your code working properly? Would love a Keras DANN implementation example. (Especially how you set the target samples' loss weights to zero in the classifier)

michetonu commented 6 years ago

@Wojova yes! Here is the gradient inversion layer: https://github.com/michetonu/gradient_reversal_keras_tf

For the sample weights you just need to create your input batches by alternating samples from the source and target domain, and pass a sample_weights array of 1s and 0s accordingly.

ghost commented 6 years ago

Thanks!

erlendd commented 6 years ago

Responding to some of the previous comments on here, I don't think that simply reversing the loss function (on the domain part of the network) is a replacement for the gradient reversing layer. The reason is that simply maximising the domain loss won't necessarily lead to domain-invariant features in the shared feature layer, as the weights of the domain-classifier layer will be able to give a bad loss even if the shared feature layer isn't domain invariant.

michetonu commented 6 years ago

@erlendd if you read the original al paper by Ganin et al. (and the few other papers implementing the same domain adversarial training approach), that's exactly what the gradient inversion layer does. It does nothing on the forward pass, and it multiplies the loss by a negative constant during backprop.

erlendd commented 6 years ago

@michetonu yeah I know, I was replying to @kilickaya 's comment above.

Am I correct in saying that the implementation of DANN here using the target labels during training? I.e. it isn't unsupervised?

michetonu commented 6 years ago

@erlendd it is unsupervised if you set the target's loss_weights to 0. That way they will still contribute to the accuracy (or whatever measure you choose) shown by Keras, but they won't contribute the loss of the first classifier, which is what you want. Basically the labels are there but don't matter. You can actually check that this works by randomising the target's labels during training and getting the same performance (using the real labels at test time)

erlendd commented 6 years ago

@michetonu sorry I was referring to the TF code here: https://github.com/pumpikano/tf-dann/blob/master/Blobs-DANN.ipynb. I don't see a loss_weights variable.

michetonu commented 6 years ago

@erlendd ah sorry, the trick is how they create the batches and run the network. They alternate the branches rather than doing forward and back prop simultaneously summing the losses. On the top branch only sample from the source domain are used for training, while the bottom branch receives both. The code is not super clear at first sight (took me a while as well) but that's how their generator works.

erlendd commented 6 years ago

Thanks, I get how it works now, but isn't there an issue with the batches not being fully shuffled? For example the first half of each batch is always the "source" set and the second half is the "target" set. I think the only way to get around this would be to use a Boolean mask, but I'm unsure.

michetonu commented 6 years ago

@erlendd just shuffle the domains separately before you create the batches! It shouldn't matter if they are split half-half once they go through the net, right? But in any case, you could even shuffle the batches one by one.

erlendd commented 6 years ago

@michetonu well the domains are already shuffled inside the batch generator code, so that's not an issue. The issue I think is that the training batch is constructed such that the first half is from the source domain and the second half is from the target domain, so with respect to the domain classifier the data has not been shuffled. Probably it won't matter if you use a smaller batch size, but if you used a much larger batch size there could be an issue, I guess.

Also I don't think it's possible to simply shuffle the training batches, as the tensorflow model here assumes that the first half of the batch is from the source domain and the second half is from the target domain. That's why I was suggesting that it could possibly be fixed using a mask in place of tf.cond in the code.

michetonu commented 6 years ago

@erlendd you are right that they are already shuffled! I wasn't looking at the code. It does not matter whether one half is always first - the model does not "remember" the labels of previous samples, so the order of training samples in each batch doesn't really matter IMHO.

erlendd commented 6 years ago

@michetonu I could be wrong, but I believe it does matter if the first half of samples always come from one distribution and the second half from another distribution - otherwise why would there be any necessity to shuffle in a simpler neural network model? I'm currently re-writing using a boolean mask so will check if this makes any sort of a difference.

michetonu commented 6 years ago

@erlendd my understanding was that shuffling prevents having the same train-test split every time, but I might be wrong!

erlendd commented 6 years ago

@michetonu you're right - order of training samples within a batch doesn't matter.

pumpikano commented 6 years ago

The reason that the order of examples within a batch does not matter is that the loss of the batch is the sum (really, a normalized sum, i.e. mean) of the losses of the examples. The gradient of a sum is the sum of the gradients of the terms, and addition is commutative, so order doesn't matter.

Of course, the batches need to be fair samples in order for the gradient of each minibatch loss to approximate the gradient of the full training set loss. An easy algorithm for getting unbiased minibatches and insuring that all examples are used is to shuffle the training examples and take minibatches sequentially until all examples are used, then repeat. Order matters in the sense that the shuffle is the reason this algorithm creates correctly sampled minibatches. Hope this helps!

michetonu commented 6 years ago

Great explanation!

qianyizhang commented 6 years ago

@pumpikano I'm a bit confused after erlendd's comment... Just to be clear, you don't necessarily need the gradient reversal layer as long as you reverses the domain_cost in the final objective (total_cost), correct?

erlendd commented 6 years ago

@qianyizhang reversing domain_cost at the final objective does do the same thing as a GRL. If you just maximise the domain cost you won't know if it's because the shared feature layer has invariance or if the final layer of the domain classifier is bad at separating the domains.

pumpikano commented 6 years ago

@erlendd I think you meant to say "does not do the same thing as a GRL"?

In any case, looking at the diagram from the paper might be helpful. Descent on the parameters of G_d (the domain classifier) are minimizing domain classification loss, whereas descent on the parameters of G_f (the feature extractor) are maximizing domain classification loss because the gradients were reversed. If you removed the GRL and invert the domain cost, descent on parameters of G_d and parameters of G_f would both be maximizing domain cost.

qianyizhang commented 6 years ago

@erlendd
@pumpikano I see now. It's tricky to understand that the feature extractor part and domain extractor part are optimizing for the opposite objective, thanks for answering :-)

tmullen93 commented 6 years ago

@michetonu

I see how setting the sample_weights to zero for the target will prevent the classifier from updating, but doesn't it also keep the domain classifier from updating? Can't you not use different sample_weights for different outputs?

michetonu commented 6 years ago

@tmullen93 if your model has two outputs you can define sample_weights as a 2D array such as [[1,1,1...0,0,0], None], one for each output.

tmullen93 commented 6 years ago

@michetonu Ahh that makes sense thank you!

What do you mean by make your batch alternate through source and target? Does this mean making a generator and forcing it to build a batch that is equal? Could this be why when I take your suggestion for the sample weight but don't make a generator to make the even batch my loss becomes NaN?

I should also note that I'm trying to do multivariate regression instead of classification.

michetonu commented 6 years ago

@tmullen93 just create mini-batches manually (with a generator or otherwise) in which target and source observations are alternated 50/50 (make sure to do the same for the labels, and careful if you re-shuffle), and then set training batch_size to a multiple of your mini-batch size. So for example if alternate target and source observations 1 by 1, you can choose any power of 2 for your batch size, as it will have the same proportion of target/source samples.

GlastonburyC commented 6 years ago

What can I do if I have multiple domains I'd like to account for? How would I set sample_weights? Arbitrarily choose a source domain and set everything else to one?

erlendd commented 6 years ago

@GlastonburyC I tried this method on data with about 100+ domains, and a single target domain. Trying to use many source domains at once with this method doesn't really work (the network can't learn very well). You might be able to use an unsupervised method (e.g. dAE, TCA) to bring the source domains closer together, but I guess it's still an open question.

GlastonburyC commented 6 years ago

Thanks for the response! Could you show me an example of how you set the sample weights for multiple domains?

On 22 Jan 2018 9:00 am, Erlend Davidson notifications@github.com wrote:

@GlastonburyChttps://github.com/glastonburyc I tried this method on data with about 100+ domains, and a single target domain. Trying to use many source domains at once with this method doesn't really work (the network can't learn very well). You might be able to use an unsupervised method (e.g. dAE, TCA) to bring the source domains closer together, but I guess it's still an open question.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pumpikano/tf-dann/issues/14#issuecomment-359361334, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEA_S0rCUQNx1Ia-6H2bwVuODeU3K37Kks5tNE4agaJpZM4OW-4K.

erlendd commented 6 years ago

@GlastonburyC you mean if you wanted to append a 'weight' column to your data, where that weight might be different for each domain?

I don't have the code in front of me but I did it using Pandas. If the training covariates are stored in a dataframe called df you can add a new column doing:

df['name_of_col'] = values

If you want to make it conditional on the domain, you can use this:

mask = df['domain'] == 'domain1' df.loc[mask, 'name_of_col'] = values

GlastonburyC commented 6 years ago

@michetonu I'm performing image segmentation, so I've bolted on a domain classifier with a gradient reversal layer.

if my batch is 50 / 50 'source' and 'target' images and I don't set any sample_weights, is the gradient reversal layer sitting between the domain classifier and my segmentation network equivalent to forcing my segmentation network to learn domain invariant features? I don't necessarily have a 'source' and 'target' I just want the segmentation network to work on a test set that maybe from a different distribution to my input?

michetonu commented 6 years ago

@GlastonburyC if you don't 'hide' the target samples from the classifier, you are not performing unsupervised domain adaptation anymore. It will probably still help, but I think there are better ways to do this if you have the labels for your 'target' dataset.

GlastonburyC commented 6 years ago

My thought process is that you don't sometimes know your test data's target domain. Therefore if you negate the gradient of the domain classifier for both source and target samples ( where the source and target are merely two distributions from a possible population of distributions) the classifier network is forced to learn domain invariant features? Is that logic correct?

Would appreciate it if you pointed me to a paper on a better method! :) Cheers for the help.

puntopasta commented 6 years ago

Thanks for this contribution. Previously we have implemented this in a very difficult way by having two different models with tied encoder weights but separate loss functions, this is way more elegant and less error-prone.

However I'm not sure what the hp_lambda stands for. What are we supposed to pass there? Mustafa

GlastonburyC commented 6 years ago

It's a scaling factor of how much you want the reversed domain gradient to be added to the classifiers weight updates (where 1 is the gradient, 0.5 is half a gradient update) etc.

On 7 Feb 2018 12:43 pm, Mustafa Radha notifications@github.com wrote:

Thanks for this contribution. Previously we have implemented this in a very difficult way by having two different models with tied encoder weights but separate loss functions, this is way more elegant and less error-prone.

However I'm not sure what the hp_lambda stands for. What are we supposed to pass there? Mustafa

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pumpikano/tf-dann/issues/14#issuecomment-363757519, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEA_S0ZaPlyZPZ-dOtJOqlaU_7v9m5Bpks5tSZp6gaJpZM4OW-4K.

puntopasta commented 6 years ago

Right, thanks for the explanation (and quick response).

I suspected something like this since it was multiplied with the gradient, good to know for sure. Going to try it out now :)

puntopasta commented 6 years ago

By the way, as an alternative to passing custom sample weights to do the iterative training of the two targets, you could probably just compile two models.

i = input()
e = encoder(i)
c = classifier(e)
a = adversarial(e)  # this includes the gradient reversal

classifier_trainer = Model(i, c); classifier_trainer.compile()
adversarial_trainer = Model(i,a); adversarial_trainer.compile()

A simple minimax training would then look something like this:

for b in range(batches):
   x, class, domain = batch(b)
   classifier_trainer.train(x,class)
   adversarial_trainer.train(x, domain)

Or am I overseeing something?

EDIT: seems to be working. The convenience is that when you're done you can just take the classifier_trainer model and use that as the "final model" for evaluation

erlendd commented 6 years ago

I just wondered how one would modify the code to do the multiple-source variant of this. I.e. following this paper: https://arxiv.org/abs/1705.09684. It isn't clearly explained how the batches are constructed with multiple source domains.

anamaqueda commented 5 years ago

@michetonu I have followed the same strategy as you regarding the network architecture: one model with two outputs.

Regarding the loss function steps, Keras works with multi-output models, and I think the loss functions are just additive. Correct me if I'm wrong, but I think I can just create one model (it seems to work). The way I'm doing it is I'm feeding both distributions as input, but setting the target samples' loss weights to zero in the classifier, so that they don't get considered in the backprop update, while removing the need to alternate inputs. I'll happily share my code once I'm sure it works properly :)

Since I am working with the fit_generator function, I have implemented the following custom loss instead of using sample weights:

def custom_categorical_crossentropy(y_true, y_pred):
    source_true = y_true[:FLAGS.batch_size]
    source_pred = y_pred[:FLAGS.batch_size]
    loss = K.categorical_crossentropy(source_true, source_pred)
    return K.mean(loss)

That makes sense to me because I create mini-batches manually alternating 50-50 source and target samples. However, I am not getting domain adaptation, even if I change hyperparameters. That's why I wonder if this custom loss implementation is correct.

I'd really appreciate any help. Thanks!

michetonu commented 5 years ago

@amn-gti-upm Sorry for the late reply! How are you using this loss? This makes sense to me as the loss for the classifier part of the network, but you also need to sum the inverted loss of the domain classifier. Are you doing that? If so, can you share the code?

anamaqueda commented 5 years ago

@michetonu As I pointed out in my previous comment, I used one model with two outputs: the label predictor (LP) and the domain classifier (DC). I used the custom_categorical_crossentropy loss for the former, and binary cross-entropy loss for the latter, like this:

model.compile(loss=[custom_categorical_crossentropy, 'binary_crossentropy'],
                  optimizer=SGD(momentum=0.9))

These two losses are additive. In order to maximize the domain classifier loss, i.e. the binary cross-entropy, I used your Gradient Reversal Layer (GRL) implementation. I finally got domain adaptation with this implementation, but fixing the lambda parameter at 1.